Educational Process Mining (EPM): A Learning Analytics Data Set. (2015). UCI Machine Learning Repository.
The final exam consisted of 16 questions where each question focused on content from a specific session. Each session has two or more related questions follows:
The questions correspond to trials, and the number of questions answered correctly correspond to events. Student participation (input data) was recorded by activity, exercise, and session. The number of trials and events for each student and session are counts, which can be modeled using Poisson or Negative Binomial Regression. This notebook uses both Poisson and Negative Binomial Regression to model the number of expected correctly answered final questions.
The original values and principle components were used with both Poisson and Negative Binomial Regression methods. Using the AIC as the performance metric, the best model from all four methods was the purely additive model with categorical features sid and actv_grp and numeric features, either the interpolated variables or the principle components: </br>
final_events ~ sid + actv_grp + numeric_features
The fact that the student ID, activity group, and numeric features were important indicates that both the student and the student's behavior as measured by mouse and keyboard activity are required to determine the outcome.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
CMPINF2120_EPM_FUNC_INCL_Over_Lisa.ipynb includes functions used in this notebook.
%run CMPINF2120_EPM_FUNC_INCL_Over_Lisa.ipynb
inputs_final_sqrt_path = 'https://raw.githubusercontent.com/lisaover/CMPINF2120_project/main/tp_sqrt_inputs_final_df.csv'
final_path = 'https://raw.githubusercontent.com/lisaover/CMPINF2120_project/main/final_df.csv'
pts_path = 'https://raw.githubusercontent.com/lisaover/CMPINF2120_project/main/final_points_lookup.csv'
final_sqrt_init = pd.read_csv(inputs_final_sqrt_path)
final_init = pd.read_csv(final_path)
pts_final_lookup = pd.read_csv(pts_path)
final_sqrt_init.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2444 entries, 0 to 2443 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sess 2444 non-null int64 1 sid 2444 non-null int64 2 actv_grp 2444 non-null object 3 total_ms_tp000_sqrt 2444 non-null float64 4 mw_tp000_sqrt 2444 non-null float64 5 mwc_tp000_sqrt 2444 non-null float64 6 mcl_tp000_sqrt 2444 non-null float64 7 mcr_tp000_sqrt 2444 non-null float64 8 mm_tp000_sqrt 2444 non-null float64 9 ks_tp000_sqrt 2444 non-null float64 10 total_ms_tp010_sqrt 2444 non-null float64 11 mw_tp010_sqrt 2444 non-null float64 12 mwc_tp010_sqrt 2444 non-null float64 13 mcl_tp010_sqrt 2444 non-null float64 14 mcr_tp010_sqrt 2444 non-null float64 15 mm_tp010_sqrt 2444 non-null float64 16 ks_tp010_sqrt 2444 non-null float64 17 total_ms_tp020_sqrt 2444 non-null float64 18 mw_tp020_sqrt 2444 non-null float64 19 mwc_tp020_sqrt 2444 non-null float64 20 mcl_tp020_sqrt 2444 non-null float64 21 mcr_tp020_sqrt 2444 non-null float64 22 mm_tp020_sqrt 2444 non-null float64 23 ks_tp020_sqrt 2444 non-null float64 24 total_ms_tp030_sqrt 2444 non-null float64 25 mw_tp030_sqrt 2444 non-null float64 26 mwc_tp030_sqrt 2444 non-null float64 27 mcl_tp030_sqrt 2444 non-null float64 28 mcr_tp030_sqrt 2444 non-null float64 29 mm_tp030_sqrt 2444 non-null float64 30 ks_tp030_sqrt 2444 non-null float64 31 total_ms_tp040_sqrt 2444 non-null float64 32 mw_tp040_sqrt 2444 non-null float64 33 mwc_tp040_sqrt 2444 non-null float64 34 mcl_tp040_sqrt 2444 non-null float64 35 mcr_tp040_sqrt 2444 non-null float64 36 mm_tp040_sqrt 2444 non-null float64 37 ks_tp040_sqrt 2444 non-null float64 38 total_ms_tp050_sqrt 2444 non-null float64 39 mw_tp050_sqrt 2444 non-null float64 40 mwc_tp050_sqrt 2444 non-null float64 41 mcl_tp050_sqrt 2444 non-null float64 42 mcr_tp050_sqrt 2444 non-null float64 43 mm_tp050_sqrt 2444 non-null float64 44 ks_tp050_sqrt 2444 non-null float64 45 total_ms_tp060_sqrt 2444 non-null float64 46 mw_tp060_sqrt 2444 non-null float64 47 mwc_tp060_sqrt 2444 non-null float64 48 mcl_tp060_sqrt 2444 non-null float64 49 mcr_tp060_sqrt 2444 non-null float64 50 mm_tp060_sqrt 2444 non-null float64 51 ks_tp060_sqrt 2444 non-null float64 52 total_ms_tp070_sqrt 2444 non-null float64 53 mw_tp070_sqrt 2444 non-null float64 54 mwc_tp070_sqrt 2444 non-null float64 55 mcl_tp070_sqrt 2444 non-null float64 56 mcr_tp070_sqrt 2444 non-null float64 57 mm_tp070_sqrt 2444 non-null float64 58 ks_tp070_sqrt 2444 non-null float64 59 total_ms_tp080_sqrt 2444 non-null float64 60 mw_tp080_sqrt 2444 non-null float64 61 mwc_tp080_sqrt 2444 non-null float64 62 mcl_tp080_sqrt 2444 non-null float64 63 mcr_tp080_sqrt 2444 non-null float64 64 mm_tp080_sqrt 2444 non-null float64 65 ks_tp080_sqrt 2444 non-null float64 66 total_ms_tp090_sqrt 2444 non-null float64 67 mw_tp090_sqrt 2444 non-null float64 68 mwc_tp090_sqrt 2444 non-null float64 69 mcl_tp090_sqrt 2444 non-null float64 70 mcr_tp090_sqrt 2444 non-null float64 71 mm_tp090_sqrt 2444 non-null float64 72 ks_tp090_sqrt 2444 non-null float64 73 total_ms_tp100_sqrt 2444 non-null float64 74 mw_tp100_sqrt 2444 non-null float64 75 mwc_tp100_sqrt 2444 non-null float64 76 mcl_tp100_sqrt 2444 non-null float64 77 mcr_tp100_sqrt 2444 non-null float64 78 mm_tp100_sqrt 2444 non-null float64 79 ks_tp100_sqrt 2444 non-null float64 80 final_events 2444 non-null float64 81 final_trials 2444 non-null float64 dtypes: float64(79), int64(2), object(1) memory usage: 1.5+ MB
final_sqrt_init.isna().sum()
sess 0
sid 0
actv_grp 0
total_ms_tp000_sqrt 0
mw_tp000_sqrt 0
..
mcr_tp100_sqrt 0
mm_tp100_sqrt 0
ks_tp100_sqrt 0
final_events 0
final_trials 0
Length: 82, dtype: int64
final_init.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62 entries, 0 to 61 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sid 62 non-null int64 1 Es_1q1 62 non-null float64 2 Es_1q2 62 non-null float64 3 Es_2q1 62 non-null float64 4 Es_2q2 62 non-null float64 5 Es_3q1 62 non-null float64 6 Es_3q2 62 non-null float64 7 Es_3q3 62 non-null float64 8 Es_3q4 62 non-null float64 9 Es_3q5 62 non-null float64 10 Es_4q1 62 non-null float64 11 Es_4q2 62 non-null float64 12 Es_5q1 62 non-null float64 13 Es_5q2 62 non-null float64 14 Es_5q3 62 non-null float64 15 Es_6q1 62 non-null float64 16 Es_6q2 62 non-null float64 17 final_score 62 non-null float64 dtypes: float64(17), int64(1) memory usage: 8.8 KB
final_init.isna().sum()
sid 0 Es_1q1 0 Es_1q2 0 Es_2q1 0 Es_2q2 0 Es_3q1 0 Es_3q2 0 Es_3q3 0 Es_3q4 0 Es_3q5 0 Es_4q1 0 Es_4q2 0 Es_5q1 0 Es_5q2 0 Es_5q3 0 Es_6q1 0 Es_6q2 0 final_score 0 dtype: int64
pts_final_lookup.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17 entries, 0 to 16 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 question 17 non-null object 1 question_points 17 non-null int64 dtypes: int64(1), object(1) memory usage: 400.0+ bytes
pts_final_lookup.isna().sum()
question 0 question_points 0 dtype: int64
Melt final_init and create a session variable
final_lf = final_init.melt( id_vars=['sid']).\
rename(columns={"variable": "question", "value": "quest_scr"}).\
copy()
final_lf['sess'] = final_lf.question.str.extract('(\d)')
Merge final_lf with pts_init and create a pass/fail variable for each student and question
final_lf_b = pd.merge(final_lf, pts_final_lookup, on='question', how='left')
final_lf_b.head()
| sid | question | quest_scr | sess | question_points | |
|---|---|---|---|---|---|
| 0 | 1 | Es_1q1 | 2.0 | 1 | 2 |
| 1 | 2 | Es_1q1 | 2.0 | 1 | 2 |
| 2 | 4 | Es_1q1 | 2.0 | 1 | 2 |
| 3 | 5 | Es_1q1 | 2.0 | 1 | 2 |
| 4 | 7 | Es_1q1 | 2.0 | 1 | 2 |
final_lf_b['Qpass'] = [1 if i/j >= 0.7 else 0 for (i, j) in zip(final_lf_b['quest_scr'],final_lf_b['question_points'])]
final_lf_b.head()
| sid | question | quest_scr | sess | question_points | Qpass | |
|---|---|---|---|---|---|---|
| 0 | 1 | Es_1q1 | 2.0 | 1 | 2 | 1 |
| 1 | 2 | Es_1q1 | 2.0 | 1 | 2 | 1 |
| 2 | 4 | Es_1q1 | 2.0 | 1 | 2 | 1 |
| 3 | 5 | Es_1q1 | 2.0 | 1 | 2 | 1 |
| 4 | 7 | Es_1q1 | 2.0 | 1 | 2 | 1 |
final_sqrt_init['sid'] = final_sqrt_init['sid'].astype('object')
final_sqrt_init['sess'] = final_sqrt_init['sess'].astype('object')
final_sqrt_df = final_sqrt_init.copy()
sqrt_vars = get_var_list(final_sqrt_df,['sqrt'])
sqrt_features_df = final_sqrt_df.loc[:, sqrt_vars].copy()
sqrt_features_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2444 entries, 0 to 2443 Data columns (total 77 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_ms_tp000_sqrt 2444 non-null float64 1 mw_tp000_sqrt 2444 non-null float64 2 mwc_tp000_sqrt 2444 non-null float64 3 mcl_tp000_sqrt 2444 non-null float64 4 mcr_tp000_sqrt 2444 non-null float64 5 mm_tp000_sqrt 2444 non-null float64 6 ks_tp000_sqrt 2444 non-null float64 7 total_ms_tp010_sqrt 2444 non-null float64 8 mw_tp010_sqrt 2444 non-null float64 9 mwc_tp010_sqrt 2444 non-null float64 10 mcl_tp010_sqrt 2444 non-null float64 11 mcr_tp010_sqrt 2444 non-null float64 12 mm_tp010_sqrt 2444 non-null float64 13 ks_tp010_sqrt 2444 non-null float64 14 total_ms_tp020_sqrt 2444 non-null float64 15 mw_tp020_sqrt 2444 non-null float64 16 mwc_tp020_sqrt 2444 non-null float64 17 mcl_tp020_sqrt 2444 non-null float64 18 mcr_tp020_sqrt 2444 non-null float64 19 mm_tp020_sqrt 2444 non-null float64 20 ks_tp020_sqrt 2444 non-null float64 21 total_ms_tp030_sqrt 2444 non-null float64 22 mw_tp030_sqrt 2444 non-null float64 23 mwc_tp030_sqrt 2444 non-null float64 24 mcl_tp030_sqrt 2444 non-null float64 25 mcr_tp030_sqrt 2444 non-null float64 26 mm_tp030_sqrt 2444 non-null float64 27 ks_tp030_sqrt 2444 non-null float64 28 total_ms_tp040_sqrt 2444 non-null float64 29 mw_tp040_sqrt 2444 non-null float64 30 mwc_tp040_sqrt 2444 non-null float64 31 mcl_tp040_sqrt 2444 non-null float64 32 mcr_tp040_sqrt 2444 non-null float64 33 mm_tp040_sqrt 2444 non-null float64 34 ks_tp040_sqrt 2444 non-null float64 35 total_ms_tp050_sqrt 2444 non-null float64 36 mw_tp050_sqrt 2444 non-null float64 37 mwc_tp050_sqrt 2444 non-null float64 38 mcl_tp050_sqrt 2444 non-null float64 39 mcr_tp050_sqrt 2444 non-null float64 40 mm_tp050_sqrt 2444 non-null float64 41 ks_tp050_sqrt 2444 non-null float64 42 total_ms_tp060_sqrt 2444 non-null float64 43 mw_tp060_sqrt 2444 non-null float64 44 mwc_tp060_sqrt 2444 non-null float64 45 mcl_tp060_sqrt 2444 non-null float64 46 mcr_tp060_sqrt 2444 non-null float64 47 mm_tp060_sqrt 2444 non-null float64 48 ks_tp060_sqrt 2444 non-null float64 49 total_ms_tp070_sqrt 2444 non-null float64 50 mw_tp070_sqrt 2444 non-null float64 51 mwc_tp070_sqrt 2444 non-null float64 52 mcl_tp070_sqrt 2444 non-null float64 53 mcr_tp070_sqrt 2444 non-null float64 54 mm_tp070_sqrt 2444 non-null float64 55 ks_tp070_sqrt 2444 non-null float64 56 total_ms_tp080_sqrt 2444 non-null float64 57 mw_tp080_sqrt 2444 non-null float64 58 mwc_tp080_sqrt 2444 non-null float64 59 mcl_tp080_sqrt 2444 non-null float64 60 mcr_tp080_sqrt 2444 non-null float64 61 mm_tp080_sqrt 2444 non-null float64 62 ks_tp080_sqrt 2444 non-null float64 63 total_ms_tp090_sqrt 2444 non-null float64 64 mw_tp090_sqrt 2444 non-null float64 65 mwc_tp090_sqrt 2444 non-null float64 66 mcl_tp090_sqrt 2444 non-null float64 67 mcr_tp090_sqrt 2444 non-null float64 68 mm_tp090_sqrt 2444 non-null float64 69 ks_tp090_sqrt 2444 non-null float64 70 total_ms_tp100_sqrt 2444 non-null float64 71 mw_tp100_sqrt 2444 non-null float64 72 mwc_tp100_sqrt 2444 non-null float64 73 mcl_tp100_sqrt 2444 non-null float64 74 mcr_tp100_sqrt 2444 non-null float64 75 mm_tp100_sqrt 2444 non-null float64 76 ks_tp100_sqrt 2444 non-null float64 dtypes: float64(77) memory usage: 1.4 MB
sqrt_feature_names = sqrt_features_df.columns
len(sqrt_feature_names)
77
final_sqrt_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2444 entries, 0 to 2443 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sess 2444 non-null object 1 sid 2444 non-null object 2 actv_grp 2444 non-null object 3 total_ms_tp000_sqrt 2444 non-null float64 4 mw_tp000_sqrt 2444 non-null float64 5 mwc_tp000_sqrt 2444 non-null float64 6 mcl_tp000_sqrt 2444 non-null float64 7 mcr_tp000_sqrt 2444 non-null float64 8 mm_tp000_sqrt 2444 non-null float64 9 ks_tp000_sqrt 2444 non-null float64 10 total_ms_tp010_sqrt 2444 non-null float64 11 mw_tp010_sqrt 2444 non-null float64 12 mwc_tp010_sqrt 2444 non-null float64 13 mcl_tp010_sqrt 2444 non-null float64 14 mcr_tp010_sqrt 2444 non-null float64 15 mm_tp010_sqrt 2444 non-null float64 16 ks_tp010_sqrt 2444 non-null float64 17 total_ms_tp020_sqrt 2444 non-null float64 18 mw_tp020_sqrt 2444 non-null float64 19 mwc_tp020_sqrt 2444 non-null float64 20 mcl_tp020_sqrt 2444 non-null float64 21 mcr_tp020_sqrt 2444 non-null float64 22 mm_tp020_sqrt 2444 non-null float64 23 ks_tp020_sqrt 2444 non-null float64 24 total_ms_tp030_sqrt 2444 non-null float64 25 mw_tp030_sqrt 2444 non-null float64 26 mwc_tp030_sqrt 2444 non-null float64 27 mcl_tp030_sqrt 2444 non-null float64 28 mcr_tp030_sqrt 2444 non-null float64 29 mm_tp030_sqrt 2444 non-null float64 30 ks_tp030_sqrt 2444 non-null float64 31 total_ms_tp040_sqrt 2444 non-null float64 32 mw_tp040_sqrt 2444 non-null float64 33 mwc_tp040_sqrt 2444 non-null float64 34 mcl_tp040_sqrt 2444 non-null float64 35 mcr_tp040_sqrt 2444 non-null float64 36 mm_tp040_sqrt 2444 non-null float64 37 ks_tp040_sqrt 2444 non-null float64 38 total_ms_tp050_sqrt 2444 non-null float64 39 mw_tp050_sqrt 2444 non-null float64 40 mwc_tp050_sqrt 2444 non-null float64 41 mcl_tp050_sqrt 2444 non-null float64 42 mcr_tp050_sqrt 2444 non-null float64 43 mm_tp050_sqrt 2444 non-null float64 44 ks_tp050_sqrt 2444 non-null float64 45 total_ms_tp060_sqrt 2444 non-null float64 46 mw_tp060_sqrt 2444 non-null float64 47 mwc_tp060_sqrt 2444 non-null float64 48 mcl_tp060_sqrt 2444 non-null float64 49 mcr_tp060_sqrt 2444 non-null float64 50 mm_tp060_sqrt 2444 non-null float64 51 ks_tp060_sqrt 2444 non-null float64 52 total_ms_tp070_sqrt 2444 non-null float64 53 mw_tp070_sqrt 2444 non-null float64 54 mwc_tp070_sqrt 2444 non-null float64 55 mcl_tp070_sqrt 2444 non-null float64 56 mcr_tp070_sqrt 2444 non-null float64 57 mm_tp070_sqrt 2444 non-null float64 58 ks_tp070_sqrt 2444 non-null float64 59 total_ms_tp080_sqrt 2444 non-null float64 60 mw_tp080_sqrt 2444 non-null float64 61 mwc_tp080_sqrt 2444 non-null float64 62 mcl_tp080_sqrt 2444 non-null float64 63 mcr_tp080_sqrt 2444 non-null float64 64 mm_tp080_sqrt 2444 non-null float64 65 ks_tp080_sqrt 2444 non-null float64 66 total_ms_tp090_sqrt 2444 non-null float64 67 mw_tp090_sqrt 2444 non-null float64 68 mwc_tp090_sqrt 2444 non-null float64 69 mcl_tp090_sqrt 2444 non-null float64 70 mcr_tp090_sqrt 2444 non-null float64 71 mm_tp090_sqrt 2444 non-null float64 72 ks_tp090_sqrt 2444 non-null float64 73 total_ms_tp100_sqrt 2444 non-null float64 74 mw_tp100_sqrt 2444 non-null float64 75 mwc_tp100_sqrt 2444 non-null float64 76 mcl_tp100_sqrt 2444 non-null float64 77 mcr_tp100_sqrt 2444 non-null float64 78 mm_tp100_sqrt 2444 non-null float64 79 ks_tp100_sqrt 2444 non-null float64 80 final_events 2444 non-null float64 81 final_trials 2444 non-null float64 dtypes: float64(79), object(3) memory usage: 1.5+ MB
sns.displot( data = final_lf_b.loc[final_lf_b.Qpass==1], x='sess', col='sid', col_wrap=2,
hue='sess', kind='hist', binwidth = 1, facet_kws={'sharey':False, 'sharex':False})
plt.show()
final_sqrt_lf = final_sqrt_df.melt(id_vars=['sess','sid','actv_grp','final_events','final_trials'], ignore_index=True).copy()
final_sqrt_lf.head()
| sess | sid | actv_grp | final_events | final_trials | variable | value | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 2.0 | 2.0 | total_ms_tp000_sqrt | 89.442719 |
| 1 | 1 | 1 | Blank | 2.0 | 2.0 | total_ms_tp000_sqrt | 89.442719 |
| 2 | 1 | 1 | Deeds | 2.0 | 2.0 | total_ms_tp000_sqrt | 202.484567 |
| 3 | 1 | 1 | Diagram | 2.0 | 2.0 | total_ms_tp000_sqrt | 939.148551 |
| 4 | 1 | 1 | Other | 2.0 | 2.0 | total_ms_tp000_sqrt | 31.622777 |
final_sqrt_lf.actv_grp.unique()
array(['Aulaweb', 'Blank', 'Deeds', 'Diagram', 'Other', 'Properties',
'Study', 'TextEditor', 'Study_Materials', 'FSM_Related', 'FSM'],
dtype=object)
actv_subgrp_1 = ['Aulaweb','Deeds','Diagram','TextEditor','FSM_Related','FSM']
actv_subgrp_2 = ['Blank','Other','Properties','Study','Study_Materials']
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==1)],
x='final_events', y='value',
col='actv_grp', row='variable', hue='actv_grp',
facet_kws={'sharey': False, 'sharex': False})
plt.show()
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==2)],
x='final_events', y='value',
col='actv_grp', row='variable', hue='actv_grp',
facet_kws={'sharey': False, 'sharex': False})
plt.show()
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==3)],
x='final_events', y='value',
col='actv_grp', row='variable', hue='actv_grp',
facet_kws={'sharey': False, 'sharex': False})
plt.show()
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==4)],
x='final_events', y='value',
col='actv_grp', row='variable', hue='actv_grp',
facet_kws={'sharey': False, 'sharex': False})
plt.show()
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==5)],
x='final_events', y='value',
col='actv_grp', row='variable', hue='actv_grp',
facet_kws={'sharey': False, 'sharex': False})
plt.show()
sns.relplot( data = final_sqrt_lf.loc[final_sqrt_lf.actv_grp.isin(actv_subgrp_1) & (final_sqrt_lf['sess']==6)],
x='final_events', y='value',
col='actv_grp', row='variable', hue='actv_grp',
facet_kws={'sharey': False, 'sharex': False})
plt.show()
final_sqrt_df.columns
Index(['sess', 'sid', 'actv_grp', 'total_ms_tp000_sqrt', 'mw_tp000_sqrt',
'mwc_tp000_sqrt', 'mcl_tp000_sqrt', 'mcr_tp000_sqrt', 'mm_tp000_sqrt',
'ks_tp000_sqrt', 'total_ms_tp010_sqrt', 'mw_tp010_sqrt',
'mwc_tp010_sqrt', 'mcl_tp010_sqrt', 'mcr_tp010_sqrt', 'mm_tp010_sqrt',
'ks_tp010_sqrt', 'total_ms_tp020_sqrt', 'mw_tp020_sqrt',
'mwc_tp020_sqrt', 'mcl_tp020_sqrt', 'mcr_tp020_sqrt', 'mm_tp020_sqrt',
'ks_tp020_sqrt', 'total_ms_tp030_sqrt', 'mw_tp030_sqrt',
'mwc_tp030_sqrt', 'mcl_tp030_sqrt', 'mcr_tp030_sqrt', 'mm_tp030_sqrt',
'ks_tp030_sqrt', 'total_ms_tp040_sqrt', 'mw_tp040_sqrt',
'mwc_tp040_sqrt', 'mcl_tp040_sqrt', 'mcr_tp040_sqrt', 'mm_tp040_sqrt',
'ks_tp040_sqrt', 'total_ms_tp050_sqrt', 'mw_tp050_sqrt',
'mwc_tp050_sqrt', 'mcl_tp050_sqrt', 'mcr_tp050_sqrt', 'mm_tp050_sqrt',
'ks_tp050_sqrt', 'total_ms_tp060_sqrt', 'mw_tp060_sqrt',
'mwc_tp060_sqrt', 'mcl_tp060_sqrt', 'mcr_tp060_sqrt', 'mm_tp060_sqrt',
'ks_tp060_sqrt', 'total_ms_tp070_sqrt', 'mw_tp070_sqrt',
'mwc_tp070_sqrt', 'mcl_tp070_sqrt', 'mcr_tp070_sqrt', 'mm_tp070_sqrt',
'ks_tp070_sqrt', 'total_ms_tp080_sqrt', 'mw_tp080_sqrt',
'mwc_tp080_sqrt', 'mcl_tp080_sqrt', 'mcr_tp080_sqrt', 'mm_tp080_sqrt',
'ks_tp080_sqrt', 'total_ms_tp090_sqrt', 'mw_tp090_sqrt',
'mwc_tp090_sqrt', 'mcl_tp090_sqrt', 'mcr_tp090_sqrt', 'mm_tp090_sqrt',
'ks_tp090_sqrt', 'total_ms_tp100_sqrt', 'mw_tp100_sqrt',
'mwc_tp100_sqrt', 'mcl_tp100_sqrt', 'mcr_tp100_sqrt', 'mm_tp100_sqrt',
'ks_tp100_sqrt', 'final_events', 'final_trials'],
dtype='object')
totl_vars = get_var_list_b(final_sqrt_df,['total'])
mw_vars = get_var_list_b(final_sqrt_df,['mw_'])
mwc_vars = get_var_list_b(final_sqrt_df,['mwc'])
mcl_vars = get_var_list_b(final_sqrt_df,['mcl'])
mcr_vars = get_var_list_b(final_sqrt_df,['mcr'])
mm_vars = get_var_list_b(final_sqrt_df,['mm'])
ks_vars = get_var_list_b(final_sqrt_df,['mws'])
sns.catplot(data = final_sqrt_df, kind='box', aspect=3.5)
plt.show()
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(data = final_sqrt_df[sqrt_feature_names].corr(),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
ax=ax)
plt.show()
Xtimepoints = StandardScaler().fit_transform( sqrt_features_df )
Xtimepoints.shape
(2444, 77)
sns.catplot(data = pd.DataFrame(Xtimepoints, columns=sqrt_feature_names), kind='box', aspect=3.5)
plt.show()
final_sqrt_std_df = pd.concat([final_sqrt_df.loc[:,['sess','sid','actv_grp','final_events','final_trials']].copy(), pd.DataFrame(Xtimepoints, columns=sqrt_feature_names).copy()], axis=1)
final_sqrt_std_df.head()
| sess | sid | actv_grp | final_events | final_trials | total_ms_tp000_sqrt | mw_tp000_sqrt | mwc_tp000_sqrt | mcl_tp000_sqrt | mcr_tp000_sqrt | ... | mcr_tp090_sqrt | mm_tp090_sqrt | ks_tp090_sqrt | total_ms_tp100_sqrt | mw_tp100_sqrt | mwc_tp100_sqrt | mcl_tp100_sqrt | mcr_tp100_sqrt | mm_tp100_sqrt | ks_tp100_sqrt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 2.0 | 2.0 | -0.790076 | -0.521312 | -0.182566 | -0.764020 | -0.612717 | ... | 1.005451 | 0.955513 | -0.736663 | 0.045447 | -0.222258 | -0.340979 | 0.889260 | 1.185153 | 1.194975 | -0.717268 |
| 1 | 1 | 1 | Blank | 2.0 | 2.0 | -0.790076 | -0.521312 | -0.182566 | -0.764020 | -0.612717 | ... | 1.027941 | 0.992772 | -0.733079 | 0.042272 | -0.224584 | -0.340979 | 0.885189 | 1.185153 | 1.191882 | -0.717268 |
| 2 | 1 | 1 | Deeds | 2.0 | 2.0 | -0.594451 | -0.214631 | -0.182566 | -0.638176 | -0.612717 | ... | 1.007958 | 0.969835 | -0.737460 | -0.099400 | -0.248136 | -0.340979 | 0.763366 | 1.129168 | 1.095563 | -0.783181 |
| 3 | 1 | 1 | Diagram | 2.0 | 2.0 | 0.680382 | 0.175782 | -0.182566 | 0.907550 | 1.709392 | ... | 0.997917 | 0.924780 | -0.755483 | 0.010015 | -0.248136 | -0.340979 | 0.845303 | 1.174033 | 1.161415 | -0.783181 |
| 4 | 1 | 1 | Other | 2.0 | 2.0 | -0.890136 | -0.521312 | -0.182566 | -0.935924 | -0.612717 | ... | 1.027941 | 1.034190 | -0.721576 | 0.042272 | -0.224584 | -0.340979 | 0.884171 | 1.185153 | 1.191564 | -0.717268 |
5 rows × 82 columns
num_features_str = ''
for ix, x in enumerate(sqrt_feature_names):
if ix == len(sqrt_feature_names) - 1:
num_features_str = num_features_str + x
else:
num_features_str = num_features_str + x + ' + '
num_features_str
'total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt'
descriptive_formulas = ['final_events ~ sid'
,'final_events ~ sid + actv_grp'
,'final_events ~ sid + actv_grp + ' + num_features_str
,'final_events ~ sid * (' + num_features_str + ')'
,'final_events ~ sid * (actv_grp + ' + num_features_str + ')'
]
predictive_formulas = ['final_events ~ ' + num_features_str
,'final_events ~ (' + num_features_str + ')**2'
,'final_events ~ actv_grp + ' + num_features_str
,'final_events ~ actv_grp * (' + num_features_str + ')'
,'final_events ~ actv_grp + (' + num_features_str + ')**2'
,'final_events ~ actv_grp * (' + num_features_str + ')**2'
]
test_formula_list = descriptive_formulas + predictive_formulas
test_formula_list
['final_events ~ sid', 'final_events ~ sid + actv_grp', 'final_events ~ sid + actv_grp + total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt', 'final_events ~ sid * (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)', 'final_events ~ sid * (actv_grp + total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)', 'final_events ~ total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt', 'final_events ~ (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)**2', 'final_events ~ actv_grp + total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt', 'final_events ~ actv_grp * (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)', 'final_events ~ actv_grp + (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)**2', 'final_events ~ actv_grp * (total_ms_tp000_sqrt + mw_tp000_sqrt + mwc_tp000_sqrt + mcl_tp000_sqrt + mcr_tp000_sqrt + mm_tp000_sqrt + ks_tp000_sqrt + total_ms_tp010_sqrt + mw_tp010_sqrt + mwc_tp010_sqrt + mcl_tp010_sqrt + mcr_tp010_sqrt + mm_tp010_sqrt + ks_tp010_sqrt + total_ms_tp020_sqrt + mw_tp020_sqrt + mwc_tp020_sqrt + mcl_tp020_sqrt + mcr_tp020_sqrt + mm_tp020_sqrt + ks_tp020_sqrt + total_ms_tp030_sqrt + mw_tp030_sqrt + mwc_tp030_sqrt + mcl_tp030_sqrt + mcr_tp030_sqrt + mm_tp030_sqrt + ks_tp030_sqrt + total_ms_tp040_sqrt + mw_tp040_sqrt + mwc_tp040_sqrt + mcl_tp040_sqrt + mcr_tp040_sqrt + mm_tp040_sqrt + ks_tp040_sqrt + total_ms_tp050_sqrt + mw_tp050_sqrt + mwc_tp050_sqrt + mcl_tp050_sqrt + mcr_tp050_sqrt + mm_tp050_sqrt + ks_tp050_sqrt + total_ms_tp060_sqrt + mw_tp060_sqrt + mwc_tp060_sqrt + mcl_tp060_sqrt + mcr_tp060_sqrt + mm_tp060_sqrt + ks_tp060_sqrt + total_ms_tp070_sqrt + mw_tp070_sqrt + mwc_tp070_sqrt + mcl_tp070_sqrt + mcr_tp070_sqrt + mm_tp070_sqrt + ks_tp070_sqrt + total_ms_tp080_sqrt + mw_tp080_sqrt + mwc_tp080_sqrt + mcl_tp080_sqrt + mcr_tp080_sqrt + mm_tp080_sqrt + ks_tp080_sqrt + total_ms_tp090_sqrt + mw_tp090_sqrt + mwc_tp090_sqrt + mcl_tp090_sqrt + mcr_tp090_sqrt + mm_tp090_sqrt + ks_tp090_sqrt + total_ms_tp100_sqrt + mw_tp100_sqrt + mwc_tp100_sqrt + mcl_tp100_sqrt + mcr_tp100_sqrt + mm_tp100_sqrt + ks_tp100_sqrt)**2']
sk_list = make_dmat(final_sqrt_std_df, test_formula_list)
model_dim = make_dim_df(final_sqrt_std_df, sk_list, test_formula_list)
model_dim
| model name | dimensions | number of obs | dim < obs | |
|---|---|---|---|---|
| 0 | 0 | 62 | 2444 | Yes |
| 1 | 1 | 72 | 2444 | Yes |
| 2 | 2 | 149 | 2444 | Yes |
| 3 | 3 | 4836 | 2444 | No |
| 4 | 4 | 5456 | 2444 | No |
| 5 | 5 | 78 | 2444 | Yes |
| 6 | 6 | 3004 | 2444 | No |
| 7 | 7 | 88 | 2444 | Yes |
| 8 | 8 | 858 | 2444 | Yes |
| 9 | 9 | 3014 | 2444 | No |
| 10 | 10 | 33044 | 2444 | No |
adjust_desc_formulas = ['final_events ~ sid'
,'final_events ~ sid + actv_grp'
,'final_events ~ sid + actv_grp + ' + num_features_str
]
adjust_pred_formulas = ['final_events ~ ' + num_features_str
,'final_events ~ actv_grp + ' + num_features_str
#,'final_events ~ actv_grp * (' + num_features_str + ')'
]
formula_list = adjust_desc_formulas + adjust_pred_formulas
model_list = []
for a_formula in formula_list:
#model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='bfgs') )
model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='ncg') )
#model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='lbfgs') )
#model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='powell') )
#model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='newton') )
#model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='cg') )
#model_list.append( smf.poisson( formula = a_formula, data = final_sqrt_std_df).fit(method='basinhopping') )
Optimization terminated successfully.
Current function value: 1.543124
Iterations: 14
Function evaluations: 15
Gradient evaluations: 15
Hessian evaluations: 14
Optimization terminated successfully.
Current function value: 1.525616
Iterations: 14
Function evaluations: 15
Gradient evaluations: 15
Hessian evaluations: 14
Optimization terminated successfully.
Current function value: 1.261055
Iterations: 25
Function evaluations: 26
Gradient evaluations: 26
Hessian evaluations: 25
Optimization terminated successfully.
Current function value: 1.429776
Iterations: 14
Function evaluations: 15
Gradient evaluations: 15
Hessian evaluations: 14
Optimization terminated successfully.
Current function value: 1.422724
Iterations: 14
Function evaluations: 15
Gradient evaluations: 15
Hessian evaluations: 14
model_results = pd.DataFrame({'model_name': ['mod00','mod01','mod02','mod03','mod04'],
'AIC': [mod.aic for mod in model_list],
'BIC': [mod.bic for mod in model_list],
'Prsquared': [mod.prsquared for mod in model_list]})
sns.relplot(data = model_results.melt(id_vars=['model_name']),
x='model_name',
y='value',
col='variable',
col_wrap=2,
facet_kws = {'sharey': False})
plt.show()
print(model_list[2].summary())
Poisson Regression Results
==============================================================================
Dep. Variable: final_events No. Observations: 2444
Model: Poisson Df Residuals: 2295
Method: MLE Df Model: 148
Date: Thu, 27 Apr 2023 Pseudo R-squ.: 0.2654
Time: 07:52:47 Log-Likelihood: -3082.0
converged: True LL-Null: -4195.6
Covariance Type: nonrobust LLR p-value: 0.000
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.7877 0.132 5.972 0.000 0.529 1.046
sid[T.2] -0.7525 0.181 -4.147 0.000 -1.108 -0.397
sid[T.4] -1.4534 0.216 -6.716 0.000 -1.878 -1.029
sid[T.5] -0.4380 0.172 -2.552 0.011 -0.774 -0.102
sid[T.7] -0.4922 0.166 -2.957 0.003 -0.819 -0.166
sid[T.8] -1.4308 0.243 -5.885 0.000 -1.907 -0.954
sid[T.9] -1.0395 0.194 -5.363 0.000 -1.419 -0.660
sid[T.11] -0.0999 0.167 -0.599 0.549 -0.427 0.227
sid[T.12] -0.7656 0.196 -3.915 0.000 -1.149 -0.382
sid[T.14] -0.2179 0.161 -1.352 0.176 -0.534 0.098
sid[T.19] -0.5119 0.187 -2.735 0.006 -0.879 -0.145
sid[T.20] 0.1848 0.155 1.190 0.234 -0.120 0.489
sid[T.22] -1.5446 0.291 -5.304 0.000 -2.115 -0.974
sid[T.24] -0.9950 0.181 -5.486 0.000 -1.350 -0.640
sid[T.25] -0.5366 0.198 -2.715 0.007 -0.924 -0.149
sid[T.30] -0.4144 0.175 -2.362 0.018 -0.758 -0.071
sid[T.33] -14.2981 344.152 -0.042 0.967 -688.824 660.228
sid[T.34] -0.8755 0.186 -4.704 0.000 -1.240 -0.511
sid[T.37] -1.2331 0.259 -4.763 0.000 -1.741 -0.726
sid[T.38] -0.8374 0.184 -4.562 0.000 -1.197 -0.478
sid[T.39] -0.6275 0.194 -3.242 0.001 -1.007 -0.248
sid[T.42] -1.4618 0.215 -6.809 0.000 -1.883 -1.041
sid[T.44] 0.4767 0.187 2.555 0.011 0.111 0.842
sid[T.45] 0.0498 0.194 0.257 0.797 -0.330 0.429
sid[T.46] -1.3844 0.285 -4.859 0.000 -1.943 -0.826
sid[T.47] -1.1223 0.205 -5.471 0.000 -1.524 -0.720
sid[T.49] -1.2151 0.193 -6.299 0.000 -1.593 -0.837
sid[T.51] -1.4011 0.223 -6.297 0.000 -1.837 -0.965
sid[T.52] -1.4096 0.213 -6.626 0.000 -1.827 -0.993
sid[T.54] -0.7311 0.183 -3.998 0.000 -1.089 -0.373
sid[T.55] 0.0382 0.175 0.219 0.827 -0.304 0.381
sid[T.56] -0.1387 0.165 -0.839 0.401 -0.463 0.185
sid[T.57] -14.6875 344.152 -0.043 0.966 -689.213 659.838
sid[T.58] -0.7754 0.439 -1.766 0.077 -1.636 0.085
sid[T.59] -1.2824 0.222 -5.780 0.000 -1.717 -0.848
sid[T.60] -15.7662 344.152 -0.046 0.963 -690.292 658.760
sid[T.61] -0.5793 0.187 -3.096 0.002 -0.946 -0.213
sid[T.62] -0.7993 0.283 -2.820 0.005 -1.355 -0.244
sid[T.64] -15.7767 344.152 -0.046 0.963 -690.302 658.749
sid[T.67] -0.2502 0.188 -1.333 0.182 -0.618 0.118
sid[T.68] 0.0169 0.158 0.107 0.915 -0.293 0.327
sid[T.69] -0.7510 0.223 -3.374 0.001 -1.187 -0.315
sid[T.70] -0.6597 0.186 -3.552 0.000 -1.024 -0.296
sid[T.71] -0.4225 0.202 -2.090 0.037 -0.819 -0.026
sid[T.73] -1.0531 0.208 -5.072 0.000 -1.460 -0.646
sid[T.75] 0.0371 0.171 0.217 0.828 -0.298 0.372
sid[T.77] -0.4970 0.438 -1.135 0.256 -1.355 0.361
sid[T.79] -0.7506 0.180 -4.179 0.000 -1.103 -0.399
sid[T.80] -0.8105 0.194 -4.182 0.000 -1.190 -0.431
sid[T.82] -1.8675 0.222 -8.394 0.000 -2.304 -1.431
sid[T.83] -1.6625 0.217 -7.659 0.000 -2.088 -1.237
sid[T.87] -0.4249 0.168 -2.532 0.011 -0.754 -0.096
sid[T.91] -1.1406 0.189 -6.044 0.000 -1.510 -0.771
sid[T.92] -0.6683 0.192 -3.473 0.001 -1.046 -0.291
sid[T.94] -0.5126 0.165 -3.098 0.002 -0.837 -0.188
sid[T.95] -1.1921 0.198 -6.023 0.000 -1.580 -0.804
sid[T.99] -1.2694 0.228 -5.575 0.000 -1.716 -0.823
sid[T.101] -1.0588 0.206 -5.150 0.000 -1.462 -0.656
sid[T.102] -1.1561 0.209 -5.526 0.000 -1.566 -0.746
sid[T.103] -16.1082 344.152 -0.047 0.963 -690.634 658.417
sid[T.104] -0.9619 0.285 -3.379 0.001 -1.520 -0.404
sid[T.106] 0.4203 0.273 1.538 0.124 -0.115 0.956
actv_grp[T.Blank] 0.0799 0.077 1.038 0.299 -0.071 0.231
actv_grp[T.Deeds] 0.0637 0.077 0.826 0.409 -0.087 0.215
actv_grp[T.Diagram] -0.0284 0.076 -0.373 0.709 -0.178 0.121
actv_grp[T.FSM] -0.5538 0.240 -2.311 0.021 -1.023 -0.084
actv_grp[T.FSM_Related] -0.4768 0.186 -2.558 0.011 -0.842 -0.112
actv_grp[T.Other] 0.1131 0.079 1.430 0.153 -0.042 0.268
actv_grp[T.Properties] -0.0046 0.075 -0.062 0.950 -0.151 0.142
actv_grp[T.Study] 0.0824 0.078 1.052 0.293 -0.071 0.236
actv_grp[T.Study_Materials] -0.0188 0.181 -0.104 0.917 -0.374 0.336
actv_grp[T.TextEditor] 0.0440 0.077 0.571 0.568 -0.107 0.195
total_ms_tp000_sqrt 0.0630 0.079 0.794 0.427 -0.092 0.218
mw_tp000_sqrt 0.0172 0.034 0.503 0.615 -0.050 0.084
mwc_tp000_sqrt -0.0254 0.028 -0.903 0.367 -0.081 0.030
mcl_tp000_sqrt -0.0184 0.117 -0.158 0.875 -0.247 0.210
mcr_tp000_sqrt -0.0937 0.037 -2.557 0.011 -0.165 -0.022
mm_tp000_sqrt 0.1445 0.095 1.529 0.126 -0.041 0.330
ks_tp000_sqrt -0.0766 0.045 -1.707 0.088 -0.165 0.011
total_ms_tp010_sqrt 0.1653 0.098 1.678 0.093 -0.028 0.358
mw_tp010_sqrt 0.0178 0.058 0.309 0.757 -0.095 0.131
mwc_tp010_sqrt 0.0184 0.047 0.391 0.696 -0.074 0.111
mcl_tp010_sqrt 0.0964 0.145 0.666 0.506 -0.188 0.380
mcr_tp010_sqrt -0.0153 0.047 -0.329 0.742 -0.107 0.076
mm_tp010_sqrt -0.3221 0.155 -2.083 0.037 -0.625 -0.019
ks_tp010_sqrt 0.0438 0.061 0.712 0.477 -0.077 0.164
total_ms_tp020_sqrt 0.0586 0.122 0.481 0.630 -0.180 0.297
mw_tp020_sqrt -0.1429 0.090 -1.595 0.111 -0.319 0.033
mwc_tp020_sqrt 0.0407 0.081 0.504 0.614 -0.117 0.199
mcl_tp020_sqrt -0.3378 0.200 -1.692 0.091 -0.729 0.053
mcr_tp020_sqrt 0.0323 0.065 0.501 0.617 -0.094 0.159
mm_tp020_sqrt 0.4155 0.240 1.730 0.084 -0.055 0.886
ks_tp020_sqrt -0.0630 0.084 -0.754 0.451 -0.227 0.101
total_ms_tp030_sqrt 0.1029 0.156 0.661 0.509 -0.202 0.408
mw_tp030_sqrt 0.2046 0.112 1.832 0.067 -0.014 0.424
mwc_tp030_sqrt -0.0238 0.093 -0.256 0.798 -0.206 0.158
mcl_tp030_sqrt 0.2395 0.262 0.915 0.360 -0.274 0.752
mcr_tp030_sqrt 0.3711 0.094 3.968 0.000 0.188 0.554
mm_tp030_sqrt -0.5317 0.306 -1.736 0.083 -1.132 0.069
ks_tp030_sqrt -0.1677 0.107 -1.567 0.117 -0.377 0.042
total_ms_tp040_sqrt 0.3439 0.190 1.809 0.070 -0.029 0.717
mw_tp040_sqrt -0.2189 0.125 -1.749 0.080 -0.464 0.026
mwc_tp040_sqrt 0.0504 0.094 0.537 0.591 -0.133 0.234
mcl_tp040_sqrt -0.2796 0.306 -0.913 0.361 -0.880 0.320
mcr_tp040_sqrt 0.1229 0.116 1.057 0.291 -0.105 0.351
mm_tp040_sqrt -0.0419 0.375 -0.112 0.911 -0.776 0.692
ks_tp040_sqrt -0.1506 0.127 -1.184 0.236 -0.400 0.099
total_ms_tp050_sqrt 0.2209 0.211 1.045 0.296 -0.193 0.635
mw_tp050_sqrt 0.6338 0.190 3.329 0.001 0.261 1.007
mwc_tp050_sqrt -0.1184 0.135 -0.879 0.379 -0.382 0.146
mcl_tp050_sqrt -0.1198 0.347 -0.346 0.730 -0.800 0.560
mcr_tp050_sqrt 0.0621 0.129 0.482 0.630 -0.190 0.314
mm_tp050_sqrt -0.3126 0.426 -0.734 0.463 -1.147 0.522
ks_tp050_sqrt 0.0126 0.133 0.095 0.924 -0.248 0.273
total_ms_tp060_sqrt -0.2903 0.239 -1.213 0.225 -0.759 0.179
mw_tp060_sqrt -0.4712 0.234 -2.011 0.044 -0.930 -0.012
mwc_tp060_sqrt 0.1481 0.217 0.681 0.496 -0.278 0.574
mcl_tp060_sqrt -0.5506 0.389 -1.414 0.157 -1.314 0.213
mcr_tp060_sqrt -0.0315 0.160 -0.196 0.844 -0.346 0.283
mm_tp060_sqrt 0.9775 0.492 1.987 0.047 0.013 1.942
ks_tp060_sqrt 0.2170 0.156 1.390 0.165 -0.089 0.523
total_ms_tp070_sqrt 0.2821 0.255 1.107 0.268 -0.217 0.782
mw_tp070_sqrt 0.2210 0.232 0.954 0.340 -0.233 0.675
mwc_tp070_sqrt 0.1204 0.258 0.467 0.640 -0.385 0.625
mcl_tp070_sqrt 0.7183 0.420 1.709 0.087 -0.105 1.542
mcr_tp070_sqrt -0.5062 0.178 -2.852 0.004 -0.854 -0.158
mm_tp070_sqrt -0.9527 0.541 -1.760 0.078 -2.014 0.108
ks_tp070_sqrt -0.2045 0.181 -1.129 0.259 -0.559 0.151
total_ms_tp080_sqrt 0.3415 0.267 1.278 0.201 -0.182 0.865
mw_tp080_sqrt 0.2295 0.292 0.786 0.432 -0.343 0.802
mwc_tp080_sqrt -0.0725 0.341 -0.213 0.832 -0.740 0.595
mcl_tp080_sqrt -0.3767 0.484 -0.778 0.436 -1.325 0.572
mcr_tp080_sqrt -0.1065 0.214 -0.498 0.619 -0.526 0.313
mm_tp080_sqrt 0.0419 0.635 0.066 0.947 -1.203 1.286
ks_tp080_sqrt 0.1794 0.201 0.893 0.372 -0.214 0.573
total_ms_tp090_sqrt -0.1998 0.279 -0.715 0.475 -0.747 0.348
mw_tp090_sqrt -0.3483 0.324 -1.076 0.282 -0.983 0.286
mwc_tp090_sqrt -0.2544 0.330 -0.771 0.441 -0.901 0.392
mcl_tp090_sqrt 0.5542 0.493 1.123 0.261 -0.413 1.521
mcr_tp090_sqrt -0.1256 0.252 -0.498 0.618 -0.619 0.368
mm_tp090_sqrt -0.2737 0.669 -0.409 0.683 -1.586 1.038
ks_tp090_sqrt 0.2912 0.206 1.411 0.158 -0.113 0.696
total_ms_tp100_sqrt -0.6287 0.179 -3.517 0.000 -0.979 -0.278
mw_tp100_sqrt 0.0136 0.208 0.065 0.948 -0.395 0.422
mwc_tp100_sqrt 0.1678 0.225 0.746 0.456 -0.273 0.609
mcl_tp100_sqrt -0.0073 0.335 -0.022 0.983 -0.665 0.650
mcr_tp100_sqrt 0.5504 0.183 3.011 0.003 0.192 0.909
mm_tp100_sqrt 0.5782 0.461 1.254 0.210 -0.326 1.482
ks_tp100_sqrt 0.0676 0.128 0.526 0.599 -0.184 0.319
===============================================================================================
my_coefplot(model_list[2])
This notebook standardizes the variables and performs PCA
features_df = sqrt_features_df.copy()
feature_names = sqrt_features_df.columns
%run CMPINF2120_EPM_PCA_INCL_Over_Lisa.ipynb
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2444 entries, 0 to 2443 Data columns (total 77 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PC01 2444 non-null float64 1 PC02 2444 non-null float64 2 PC03 2444 non-null float64 3 PC04 2444 non-null float64 4 PC05 2444 non-null float64 5 PC06 2444 non-null float64 6 PC07 2444 non-null float64 7 PC08 2444 non-null float64 8 PC09 2444 non-null float64 9 PC10 2444 non-null float64 10 PC11 2444 non-null float64 11 PC12 2444 non-null float64 12 PC13 2444 non-null float64 13 PC14 2444 non-null float64 14 PC15 2444 non-null float64 15 PC16 2444 non-null float64 16 PC17 2444 non-null float64 17 PC18 2444 non-null float64 18 PC19 2444 non-null float64 19 PC20 2444 non-null float64 20 PC21 2444 non-null float64 21 PC22 2444 non-null float64 22 PC23 2444 non-null float64 23 PC24 2444 non-null float64 24 PC25 2444 non-null float64 25 PC26 2444 non-null float64 26 PC27 2444 non-null float64 27 PC28 2444 non-null float64 28 PC29 2444 non-null float64 29 PC30 2444 non-null float64 30 PC31 2444 non-null float64 31 PC32 2444 non-null float64 32 PC33 2444 non-null float64 33 PC34 2444 non-null float64 34 PC35 2444 non-null float64 35 PC36 2444 non-null float64 36 PC37 2444 non-null float64 37 PC38 2444 non-null float64 38 PC39 2444 non-null float64 39 PC40 2444 non-null float64 40 PC41 2444 non-null float64 41 PC42 2444 non-null float64 42 PC43 2444 non-null float64 43 PC44 2444 non-null float64 44 PC45 2444 non-null float64 45 PC46 2444 non-null float64 46 PC47 2444 non-null float64 47 PC48 2444 non-null float64 48 PC49 2444 non-null float64 49 PC50 2444 non-null float64 50 PC51 2444 non-null float64 51 PC52 2444 non-null float64 52 PC53 2444 non-null float64 53 PC54 2444 non-null float64 54 PC55 2444 non-null float64 55 PC56 2444 non-null float64 56 PC57 2444 non-null float64 57 PC58 2444 non-null float64 58 PC59 2444 non-null float64 59 PC60 2444 non-null float64 60 PC61 2444 non-null float64 61 PC62 2444 non-null float64 62 PC63 2444 non-null float64 63 PC64 2444 non-null float64 64 PC65 2444 non-null float64 65 PC66 2444 non-null float64 66 PC67 2444 non-null float64 67 PC68 2444 non-null float64 68 PC69 2444 non-null float64 69 PC70 2444 non-null float64 70 PC71 2444 non-null float64 71 PC72 2444 non-null float64 72 PC73 2444 non-null float64 73 PC74 2444 non-null float64 74 PC75 2444 non-null float64 75 PC76 2444 non-null float64 76 PC77 2444 non-null float64 dtypes: float64(77) memory usage: 1.4 MB
first_pc_scores_df = pc_scores_df.copy()
input_sqrt_df and PCs from pc_scores_df¶first_pc_scores_df.head()
| PC01 | PC02 | PC03 | PC04 | PC05 | PC06 | PC07 | PC08 | PC09 | PC10 | ... | PC68 | PC69 | PC70 | PC71 | PC72 | PC73 | PC74 | PC75 | PC76 | PC77 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.343494 | 0.614381 | 3.048655 | -2.980032 | -0.717602 | -2.483105 | -0.946123 | -0.109224 | 0.802887 | 0.573025 | ... | 0.021226 | 0.007764 | -0.042438 | 0.007726 | 0.016882 | -0.033860 | -0.020212 | -0.020811 | 0.002340 | 0.013947 |
| 1 | 2.264423 | -0.252348 | 3.635242 | -1.695365 | -0.567017 | -2.818182 | -1.183510 | -1.495354 | 0.493679 | 1.094633 | ... | -0.054488 | 0.016730 | 0.010136 | -0.022828 | -0.049764 | -0.025509 | -0.011965 | -0.003058 | 0.010145 | -0.003176 |
| 2 | 2.407197 | -0.285384 | 3.514516 | -1.835526 | -0.700168 | -2.871521 | -1.200218 | -1.409066 | 0.499980 | 0.042987 | ... | -0.008969 | -0.002833 | -0.015106 | 0.000015 | -0.020132 | -0.009533 | -0.014708 | 0.001780 | 0.028581 | -0.000724 |
| 3 | 1.800267 | -0.177009 | 3.836746 | -0.226538 | 0.360890 | -3.378267 | -1.494108 | 1.350565 | 1.428391 | -0.292023 | ... | -0.052102 | 0.019233 | -0.019455 | -0.021534 | -0.068026 | -0.055084 | 0.009613 | 0.001835 | -0.002256 | -0.030537 |
| 4 | 2.285621 | -0.236315 | 3.690869 | -1.931407 | -0.768909 | -2.861841 | -1.110547 | -1.684833 | 0.437358 | 0.939661 | ... | -0.031325 | -0.003566 | 0.041341 | -0.008955 | -0.027152 | -0.000594 | -0.012622 | 0.027929 | 0.006442 | -0.031819 |
5 rows × 77 columns
final_sqrt_df.head()
| sess | sid | actv_grp | total_ms_tp000_sqrt | mw_tp000_sqrt | mwc_tp000_sqrt | mcl_tp000_sqrt | mcr_tp000_sqrt | mm_tp000_sqrt | ks_tp000_sqrt | ... | ks_tp090_sqrt | total_ms_tp100_sqrt | mw_tp100_sqrt | mwc_tp100_sqrt | mcl_tp100_sqrt | mcr_tp100_sqrt | mm_tp100_sqrt | ks_tp100_sqrt | final_events | final_trials | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 89.442719 | 0.000000 | 0.0 | 2.000000 | 0.000000 | 21.931712 | 0.000000 | ... | 31.572140 | 2563.981279 | 21.213203 | 0.0 | 65.696271 | 17.146428 | 570.063154 | 34.409301 | 2.0 | 2.0 |
| 1 | 1 | 1 | Blank | 89.442719 | 0.000000 | 0.0 | 2.000000 | 0.000000 | 23.237900 | 0.000000 | ... | 31.629101 | 2562.420730 | 21.166010 | 0.0 | 65.635356 | 17.146428 | 569.584937 | 34.409301 | 2.0 | 2.0 |
| 2 | 1 | 1 | Deeds | 202.484567 | 2.449490 | 0.0 | 3.464102 | 0.000000 | 46.054316 | 2.000000 | ... | 31.559468 | 2492.789602 | 20.688161 | 0.0 | 63.812225 | 16.852300 | 554.692708 | 33.346664 | 2.0 | 2.0 |
| 3 | 1 | 1 | Diagram | 939.148551 | 5.567764 | 0.0 | 21.447611 | 7.348469 | 183.891272 | 4.582576 | ... | 31.272992 | 2546.566316 | 20.688161 | 0.0 | 65.038450 | 17.088007 | 564.874322 | 33.346664 | 2.0 | 2.0 |
| 4 | 1 | 1 | Other | 31.622777 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 9.165151 | 0.000000 | ... | 31.811947 | 2562.420730 | 21.166010 | 0.0 | 65.620119 | 17.146428 | 569.535776 | 34.409301 | 2.0 | 2.0 |
5 rows × 82 columns
final_sqrt_df.loc[:, ['sess','sid','actv_grp','final_events','final_trials']]
| sess | sid | actv_grp | final_events | final_trials | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 2.0 | 2.0 |
| 1 | 1 | 1 | Blank | 2.0 | 2.0 |
| 2 | 1 | 1 | Deeds | 2.0 | 2.0 |
| 3 | 1 | 1 | Diagram | 2.0 | 2.0 |
| 4 | 1 | 1 | Other | 2.0 | 2.0 |
| ... | ... | ... | ... | ... | ... |
| 2439 | 6 | 102 | FSM_Related | 0.0 | 2.0 |
| 2440 | 6 | 102 | Other | 0.0 | 2.0 |
| 2441 | 6 | 102 | Properties | 0.0 | 2.0 |
| 2442 | 6 | 102 | Study | 0.0 | 2.0 |
| 2443 | 6 | 102 | TextEditor | 0.0 | 2.0 |
2444 rows × 5 columns
pd.concat([final_sqrt_df.loc[:,['sess','sid','actv_grp','final_events','final_trials']].copy(), first_pc_scores_df], axis=1)
| sess | sid | actv_grp | final_events | final_trials | PC01 | PC02 | PC03 | PC04 | PC05 | ... | PC68 | PC69 | PC70 | PC71 | PC72 | PC73 | PC74 | PC75 | PC76 | PC77 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 2.0 | 2.0 | -1.343494 | 0.614381 | 3.048655 | -2.980032 | -0.717602 | ... | 0.021226 | 0.007764 | -0.042438 | 0.007726 | 0.016882 | -0.033860 | -0.020212 | -0.020811 | 0.002340 | 0.013947 |
| 1 | 1 | 1 | Blank | 2.0 | 2.0 | 2.264423 | -0.252348 | 3.635242 | -1.695365 | -0.567017 | ... | -0.054488 | 0.016730 | 0.010136 | -0.022828 | -0.049764 | -0.025509 | -0.011965 | -0.003058 | 0.010145 | -0.003176 |
| 2 | 1 | 1 | Deeds | 2.0 | 2.0 | 2.407197 | -0.285384 | 3.514516 | -1.835526 | -0.700168 | ... | -0.008969 | -0.002833 | -0.015106 | 0.000015 | -0.020132 | -0.009533 | -0.014708 | 0.001780 | 0.028581 | -0.000724 |
| 3 | 1 | 1 | Diagram | 2.0 | 2.0 | 1.800267 | -0.177009 | 3.836746 | -0.226538 | 0.360890 | ... | -0.052102 | 0.019233 | -0.019455 | -0.021534 | -0.068026 | -0.055084 | 0.009613 | 0.001835 | -0.002256 | -0.030537 |
| 4 | 1 | 1 | Other | 2.0 | 2.0 | 2.285621 | -0.236315 | 3.690869 | -1.931407 | -0.768909 | ... | -0.031325 | -0.003566 | 0.041341 | -0.008955 | -0.027152 | -0.000594 | -0.012622 | 0.027929 | 0.006442 | -0.031819 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2439 | 6 | 102 | FSM_Related | 0.0 | 2.0 | 1.539805 | -0.209572 | 5.056273 | -0.625754 | 1.067768 | ... | -0.051070 | 0.002365 | -0.015540 | 0.000878 | -0.008006 | 0.012859 | -0.005245 | -0.003537 | -0.004329 | 0.007108 |
| 2440 | 6 | 102 | Other | 0.0 | 2.0 | -3.926028 | 1.379748 | 3.355224 | -4.322836 | 0.003769 | ... | -0.022156 | 0.008253 | -0.027522 | -0.007256 | 0.063841 | -0.014692 | 0.001131 | -0.003470 | 0.003893 | -0.011124 |
| 2441 | 6 | 102 | Properties | 0.0 | 2.0 | 1.810717 | -0.137528 | 5.400015 | -0.807313 | 1.201469 | ... | -0.004866 | -0.000190 | -0.007158 | -0.008692 | -0.018735 | -0.015721 | -0.015870 | 0.008795 | -0.032018 | 0.004263 |
| 2442 | 6 | 102 | Study | 0.0 | 2.0 | 0.895392 | 0.111307 | 4.748045 | -2.764785 | -0.059818 | ... | 0.002408 | -0.067429 | 0.027239 | 0.034184 | 0.016383 | -0.008254 | 0.006501 | -0.013523 | -0.007994 | 0.010186 |
| 2443 | 6 | 102 | TextEditor | 0.0 | 2.0 | -0.263473 | 0.345440 | 4.523714 | -1.698905 | 0.928430 | ... | 0.061311 | -0.020104 | -0.030066 | 0.015619 | 0.085143 | -0.015343 | -0.024832 | 0.008714 | 0.012625 | 0.005287 |
2444 rows × 82 columns
pc_df_to_model = pd.concat([final_sqrt_df.loc[:,['sess','sid','actv_grp','final_events','final_trials']].copy(), pc_scores_df.copy()], axis=1)
pc_features = ['PC01','PC02','PC03','PC04','PC05','PC06','PC07','PC08']
pc_features_str = ''
for ix, x in enumerate(pc_features):
if ix == len(pc_features) - 1:
pc_features_str = pc_features_str + x
else:
pc_features_str = pc_features_str + x + ' + '
pc_features_str
'PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08'
pc_y, pc_X = dmatrices('final_events ~ (' + pc_features_str + ')**2', data=pc_df_to_model, return_type='dataframe')
pc_X.head()
| Intercept | PC01 | PC02 | PC03 | PC04 | PC05 | PC06 | PC07 | PC08 | PC01:PC02 | ... | PC04:PC05 | PC04:PC06 | PC04:PC07 | PC04:PC08 | PC05:PC06 | PC05:PC07 | PC05:PC08 | PC06:PC07 | PC06:PC08 | PC07:PC08 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | -1.343494 | 0.614381 | 3.048655 | -2.980032 | -0.717602 | -2.483105 | -0.946123 | -0.109224 | -0.825417 | ... | 2.138478 | 7.399732 | 2.819477 | 0.325492 | 1.781882 | 0.678940 | 0.078380 | 2.349322 | 0.271215 | 0.103340 |
| 1 | 1.0 | 2.264423 | -0.252348 | 3.635242 | -1.695365 | -0.567017 | -2.818182 | -1.183510 | -1.495354 | -0.571422 | ... | 0.961300 | 4.777845 | 2.006481 | 2.535169 | 1.597957 | 0.671070 | 0.847891 | 3.335346 | 4.214178 | 1.769766 |
| 2 | 1.0 | 2.407197 | -0.285384 | 3.514516 | -1.835526 | -0.700168 | -2.871521 | -1.200218 | -1.409066 | -0.686975 | ... | 1.285177 | 5.270754 | 2.203032 | 2.586378 | 2.010548 | 0.840355 | 0.986583 | 3.446452 | 4.046164 | 1.691187 |
| 3 | 1.0 | 1.800267 | -0.177009 | 3.836746 | -0.226538 | 0.360890 | -3.378267 | -1.494108 | 1.350565 | -0.318664 | ... | -0.081755 | 0.765305 | 0.338472 | -0.305954 | -1.219183 | -0.539209 | 0.487405 | 5.047496 | -4.562567 | -2.017889 |
| 4 | 1.0 | 2.285621 | -0.236315 | 3.690869 | -1.931407 | -0.768909 | -2.861841 | -1.110547 | -1.684833 | -0.540126 | ... | 1.485077 | 5.527380 | 2.144918 | 3.254098 | 2.200496 | 0.853909 | 1.295483 | 3.178208 | 4.821723 | 1.871085 |
5 rows × 37 columns
pc_X.columns
Index(['Intercept', 'PC01', 'PC02', 'PC03', 'PC04', 'PC05', 'PC06', 'PC07',
'PC08', 'PC01:PC02', 'PC01:PC03', 'PC01:PC04', 'PC01:PC05', 'PC01:PC06',
'PC01:PC07', 'PC01:PC08', 'PC02:PC03', 'PC02:PC04', 'PC02:PC05',
'PC02:PC06', 'PC02:PC07', 'PC02:PC08', 'PC03:PC04', 'PC03:PC05',
'PC03:PC06', 'PC03:PC07', 'PC03:PC08', 'PC04:PC05', 'PC04:PC06',
'PC04:PC07', 'PC04:PC08', 'PC05:PC06', 'PC05:PC07', 'PC05:PC08',
'PC06:PC07', 'PC06:PC08', 'PC07:PC08'],
dtype='object')
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(data = pc_X.drop(columns=['Intercept']).corr(),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
ax=ax)
plt.show()
features_df = pc_X.drop(columns=['Intercept']).copy()
feature_names = features_df.columns
%run CMPINF2120_EPM_PCA_INCL_Over_Lisa.ipynb
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2444 entries, 0 to 2443 Data columns (total 36 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PC01 2444 non-null float64 1 PC02 2444 non-null float64 2 PC03 2444 non-null float64 3 PC04 2444 non-null float64 4 PC05 2444 non-null float64 5 PC06 2444 non-null float64 6 PC07 2444 non-null float64 7 PC08 2444 non-null float64 8 PC09 2444 non-null float64 9 PC10 2444 non-null float64 10 PC11 2444 non-null float64 11 PC12 2444 non-null float64 12 PC13 2444 non-null float64 13 PC14 2444 non-null float64 14 PC15 2444 non-null float64 15 PC16 2444 non-null float64 16 PC17 2444 non-null float64 17 PC18 2444 non-null float64 18 PC19 2444 non-null float64 19 PC20 2444 non-null float64 20 PC21 2444 non-null float64 21 PC22 2444 non-null float64 22 PC23 2444 non-null float64 23 PC24 2444 non-null float64 24 PC25 2444 non-null float64 25 PC26 2444 non-null float64 26 PC27 2444 non-null float64 27 PC28 2444 non-null float64 28 PC29 2444 non-null float64 29 PC30 2444 non-null float64 30 PC31 2444 non-null float64 31 PC32 2444 non-null float64 32 PC33 2444 non-null float64 33 PC34 2444 non-null float64 34 PC35 2444 non-null float64 35 PC36 2444 non-null float64 dtypes: float64(36) memory usage: 687.5 KB
pc_scores_df.shape
(2444, 36)
pc_descr_formulas = ['final_events ~ sid'
,'final_events ~ sid + actv_grp'
,'final_events ~ sid + actv_grp + ' + pc_features_str
,'final_events ~ sid * (' + pc_features_str + ')'
,'final_events ~ sid * (actv_grp + ' + pc_features_str + ')'
]
pc_pred_formulas = ['final_events ~ ' + pc_features_str
,'final_events ~ (' + pc_features_str + ')**2'
,'final_events ~ actv_grp + ' + pc_features_str
,'final_events ~ actv_grp * (' + pc_features_str + ')'
,'final_events ~ actv_grp + (' + pc_features_str + ')**2'
,'final_events ~ actv_grp * (' + pc_features_str + ')**2'
]
pc_formula_test_list = pc_descr_formulas + pc_pred_formulas
pc_formula_test_list
['final_events ~ sid', 'final_events ~ sid + actv_grp', 'final_events ~ sid + actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08', 'final_events ~ sid * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)', 'final_events ~ sid * (actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)', 'final_events ~ PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08', 'final_events ~ (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2', 'final_events ~ actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08', 'final_events ~ actv_grp * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)', 'final_events ~ actv_grp + (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2', 'final_events ~ actv_grp * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2']
##### Evaluate the number of features with dmatrices
pc_sk_list = make_dmat(pc_df_to_model, pc_formula_test_list)
pc_model_dim = make_dim_df(pc_df_to_model, pc_sk_list, pc_formula_test_list)
pc_model_dim
| model name | dimensions | number of obs | dim < obs | |
|---|---|---|---|---|
| 0 | 0 | 62 | 2444 | Yes |
| 1 | 1 | 72 | 2444 | Yes |
| 2 | 2 | 80 | 2444 | Yes |
| 3 | 3 | 558 | 2444 | Yes |
| 4 | 4 | 1178 | 2444 | Yes |
| 5 | 5 | 9 | 2444 | Yes |
| 6 | 6 | 37 | 2444 | Yes |
| 7 | 7 | 19 | 2444 | Yes |
| 8 | 8 | 99 | 2444 | Yes |
| 9 | 9 | 47 | 2444 | Yes |
| 10 | 10 | 407 | 2444 | Yes |
pc_adjust_desc_formulas = ['final_events ~ sid'
,'final_events ~ sid + actv_grp'
,'final_events ~ sid + actv_grp + ' + pc_features_str
#,'final_events ~ sid * (' + pc_features_str + ')'
#,'final_events ~ sid * (actv_grp + ' + pc_features_str + ')'
]
pc_adjust_pred_formulas = ['final_events ~ ' + pc_features_str
,'final_events ~ (' + pc_features_str + ')**2'
,'final_events ~ actv_grp + ' + pc_features_str
,'final_events ~ actv_grp * (' + pc_features_str + ')'
,'final_events ~ actv_grp + (' + pc_features_str + ')**2'
#,'final_events ~ actv_grp * (' + pc_features_str + ')**2'
]
pc_formula_list = pc_adjust_desc_formulas + pc_adjust_pred_formulas
pc_formula_list
['final_events ~ sid', 'final_events ~ sid + actv_grp', 'final_events ~ sid + actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08', 'final_events ~ PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08', 'final_events ~ (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2', 'final_events ~ actv_grp + PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08', 'final_events ~ actv_grp * (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)', 'final_events ~ actv_grp + (PC01 + PC02 + PC03 + PC04 + PC05 + PC06 + PC07 + PC08)**2']
pc_model_list = []
for a_formula in pc_formula_list:
#pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='bfgs') )
pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='ncg') )
#pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='lbfgs') )
#pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='powell') )
#pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='newton') )
#pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='cg') )
#pc_model_list.append( smf.poisson( formula = a_formula, data = pc_df_to_model).fit(method='basinhopping') )
Optimization terminated successfully.
Current function value: 1.543124
Iterations: 14
Function evaluations: 15
Gradient evaluations: 15
Hessian evaluations: 14
Optimization terminated successfully.
Current function value: 1.525616
Iterations: 14
Function evaluations: 15
Gradient evaluations: 15
Hessian evaluations: 14
Optimization terminated successfully.
Current function value: 1.325239
Iterations: 22
Function evaluations: 24
Gradient evaluations: 24
Hessian evaluations: 22
Optimization terminated successfully.
Current function value: 1.548077
Iterations: 8
Function evaluations: 9
Gradient evaluations: 9
Hessian evaluations: 8
Optimization terminated successfully.
Current function value: 1.505923
Iterations: 12
Function evaluations: 13
Gradient evaluations: 13
Hessian evaluations: 12
Optimization terminated successfully.
Current function value: 1.536368
Iterations: 12
Function evaluations: 13
Gradient evaluations: 13
Hessian evaluations: 12
Optimization terminated successfully.
Current function value: 1.501270
Iterations: 12
Function evaluations: 13
Gradient evaluations: 13
Hessian evaluations: 12
Optimization terminated successfully.
Current function value: 1.497400
Iterations: 15
Function evaluations: 16
Gradient evaluations: 16
Hessian evaluations: 15
pc_model_results = pd.DataFrame({'model_name': ['pc_mod00','pc_mod01','pc_mod02','pc_mod03','pc_mod04','pc_mod05','pc_mod06','pc_mod07'],
'AIC': [mod.aic for mod in pc_model_list],
'BIC': [mod.bic for mod in pc_model_list],
'Prsquared': [mod.prsquared for mod in pc_model_list]})
pc_model_results
| model_name | AIC | BIC | Prsquared | |
|---|---|---|---|---|
| 0 | pc_mod00 | 7666.789117 | 8026.475379 | 0.101106 |
| 1 | pc_mod01 | 7601.210361 | 8018.910536 | 0.111305 |
| 2 | pc_mod02 | 6637.769417 | 7101.880723 | 0.228027 |
| 3 | pc_mod03 | 7584.999588 | 7637.212110 | 0.098221 |
| 4 | pc_mod04 | 7434.950334 | 7649.601813 | 0.122776 |
| 5 | pc_mod05 | 7547.767911 | 7657.994346 | 0.105041 |
| 6 | pc_mod06 | 7536.208790 | 8110.546531 | 0.125487 |
| 7 | pc_mod07 | 7413.290030 | 7685.955422 | 0.127741 |
sns.relplot(data = pc_model_results.melt(id_vars=['model_name']),
x='model_name',
y='value',
col='variable',
col_wrap=2,
facet_kws = {'sharey': False},
height=5, aspect=2)
plt.show()
print(pc_model_list[2].summary())
Poisson Regression Results
==============================================================================
Dep. Variable: final_events No. Observations: 2444
Model: Poisson Df Residuals: 2364
Method: MLE Df Model: 79
Date: Thu, 27 Apr 2023 Pseudo R-squ.: 0.2280
Time: 07:52:57 Log-Likelihood: -3238.9
converged: True LL-Null: -4195.6
Covariance Type: nonrobust LLR p-value: 0.000
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.8569 0.125 6.841 0.000 0.611 1.102
sid[T.2] -0.6452 0.171 -3.766 0.000 -0.981 -0.309
sid[T.4] -1.4606 0.199 -7.326 0.000 -1.851 -1.070
sid[T.5] -0.1323 0.163 -0.813 0.416 -0.451 0.187
sid[T.7] -0.4821 0.157 -3.071 0.002 -0.790 -0.174
sid[T.8] -1.5239 0.236 -6.462 0.000 -1.986 -1.062
sid[T.9] -0.8776 0.179 -4.896 0.000 -1.229 -0.526
sid[T.11] -0.2181 0.161 -1.356 0.175 -0.533 0.097
sid[T.12] -0.8716 0.183 -4.765 0.000 -1.230 -0.513
sid[T.14] -0.0160 0.154 -0.104 0.917 -0.318 0.286
sid[T.19] -0.4075 0.182 -2.245 0.025 -0.763 -0.052
sid[T.20] 0.3226 0.145 2.221 0.026 0.038 0.607
sid[T.22] -1.7517 0.278 -6.308 0.000 -2.296 -1.207
sid[T.24] -0.6711 0.170 -3.952 0.000 -1.004 -0.338
sid[T.25] -0.7196 0.187 -3.844 0.000 -1.086 -0.353
sid[T.30] -0.1450 0.165 -0.880 0.379 -0.468 0.178
sid[T.33] -12.0511 119.257 -0.101 0.920 -245.791 221.689
sid[T.34] -1.0135 0.178 -5.681 0.000 -1.363 -0.664
sid[T.37] -0.9214 0.251 -3.671 0.000 -1.413 -0.430
sid[T.38] -0.7803 0.171 -4.565 0.000 -1.115 -0.445
sid[T.39] -0.6693 0.170 -3.947 0.000 -1.002 -0.337
sid[T.42] -1.4754 0.192 -7.667 0.000 -1.853 -1.098
sid[T.44] 1.2138 0.161 7.533 0.000 0.898 1.530
sid[T.45] 0.2702 0.182 1.484 0.138 -0.087 0.627
sid[T.46] -1.0982 0.277 -3.962 0.000 -1.641 -0.555
sid[T.47] -1.1412 0.195 -5.853 0.000 -1.523 -0.759
sid[T.49] -1.1224 0.183 -6.138 0.000 -1.481 -0.764
sid[T.51] -1.4247 0.212 -6.708 0.000 -1.841 -1.008
sid[T.52] -1.4138 0.205 -6.898 0.000 -1.815 -1.012
sid[T.54] -0.9313 0.177 -5.261 0.000 -1.278 -0.584
sid[T.55] 0.0668 0.170 0.394 0.694 -0.266 0.399
sid[T.56] -0.1048 0.154 -0.680 0.496 -0.407 0.197
sid[T.57] -12.7017 119.257 -0.107 0.915 -246.442 221.039
sid[T.58] -0.6118 0.427 -1.434 0.152 -1.448 0.224
sid[T.59] -1.3104 0.212 -6.174 0.000 -1.726 -0.894
sid[T.60] -13.4938 119.257 -0.113 0.910 -247.234 220.247
sid[T.61] -0.5787 0.178 -3.246 0.001 -0.928 -0.229
sid[T.62] -0.6456 0.276 -2.335 0.020 -1.187 -0.104
sid[T.64] -13.6254 119.257 -0.114 0.909 -247.366 220.115
sid[T.67] -0.1626 0.180 -0.903 0.366 -0.515 0.190
sid[T.68] 0.0336 0.151 0.222 0.824 -0.263 0.330
sid[T.69] -0.7821 0.214 -3.649 0.000 -1.202 -0.362
sid[T.70] -0.7380 0.176 -4.188 0.000 -1.083 -0.393
sid[T.71] -0.4077 0.197 -2.072 0.038 -0.793 -0.022
sid[T.73] -0.8393 0.194 -4.335 0.000 -1.219 -0.460
sid[T.75] 0.0326 0.161 0.203 0.839 -0.283 0.348
sid[T.77] -0.6609 0.426 -1.552 0.121 -1.495 0.174
sid[T.79] -0.8551 0.165 -5.192 0.000 -1.178 -0.532
sid[T.80] -0.9316 0.180 -5.178 0.000 -1.284 -0.579
sid[T.82] -1.6979 0.212 -7.995 0.000 -2.114 -1.282
sid[T.83] -1.5917 0.203 -7.853 0.000 -1.989 -1.194
sid[T.87] -0.5131 0.160 -3.210 0.001 -0.826 -0.200
sid[T.91] -1.1428 0.182 -6.264 0.000 -1.500 -0.785
sid[T.92] -0.5063 0.181 -2.804 0.005 -0.860 -0.152
sid[T.94] -0.4604 0.155 -2.962 0.003 -0.765 -0.156
sid[T.95] -1.0420 0.185 -5.623 0.000 -1.405 -0.679
sid[T.99] -1.2638 0.212 -5.957 0.000 -1.680 -0.848
sid[T.101] -0.9256 0.197 -4.697 0.000 -1.312 -0.539
sid[T.102] -1.2173 0.200 -6.097 0.000 -1.609 -0.826
sid[T.103] -13.7455 119.257 -0.115 0.908 -247.486 219.995
sid[T.104] -0.8747 0.277 -3.154 0.002 -1.418 -0.331
sid[T.106] 0.7139 0.210 3.397 0.001 0.302 1.126
actv_grp[T.Blank] 0.0110 0.074 0.149 0.882 -0.134 0.156
actv_grp[T.Deeds] 0.0232 0.072 0.321 0.748 -0.119 0.165
actv_grp[T.Diagram] -0.1319 0.073 -1.816 0.069 -0.274 0.010
actv_grp[T.FSM] -0.8886 0.234 -3.805 0.000 -1.346 -0.431
actv_grp[T.FSM_Related] -0.6528 0.182 -3.578 0.000 -1.011 -0.295
actv_grp[T.Other] 0.0238 0.076 0.313 0.755 -0.125 0.173
actv_grp[T.Properties] -0.0863 0.073 -1.189 0.235 -0.229 0.056
actv_grp[T.Study] 0.0159 0.075 0.212 0.832 -0.131 0.163
actv_grp[T.Study_Materials] -0.2142 0.172 -1.245 0.213 -0.551 0.123
actv_grp[T.TextEditor] -0.0107 0.073 -0.147 0.883 -0.153 0.132
PC01 0.0564 0.004 15.066 0.000 0.049 0.064
PC02 -0.0306 0.008 -4.048 0.000 -0.045 -0.016
PC03 -0.1173 0.008 -15.385 0.000 -0.132 -0.102
PC04 0.0250 0.008 3.073 0.002 0.009 0.041
PC05 -0.0447 0.009 -5.207 0.000 -0.062 -0.028
PC06 -0.1237 0.010 -12.826 0.000 -0.143 -0.105
PC07 0.0997 0.012 8.191 0.000 0.076 0.124
PC08 -0.0274 0.015 -1.835 0.066 -0.057 0.002
===============================================================================================
my_coefplot(pc_model_list[2])
sid and actv_grp variables and the PC features is the best model¶pc_df_to_model.final_events.mean()
1.376022913256956
pc_df_to_model.final_events.var()
2.3829041926797476
model_list[2].fittedvalues
0 1.127279
1 0.879788
2 0.913107
3 0.828572
4 1.117086
...
2439 -1.544863
2440 -1.581286
2441 -1.119468
2442 -1.119786
2443 -1.420777
Length: 2444, dtype: float64
np.exp(model_list[2].fittedvalues)
0 3.087246
1 2.410388
2 2.492053
3 2.290045
4 3.055937
...
2439 0.213341
2440 0.205710
2441 0.326453
2442 0.326350
2443 0.241526
Length: 2444, dtype: float64
df02 = pc_df_to_model.copy()
df02['avg_count'] = np.exp( pc_model_list[2].fittedvalues )
df02.head()
| sess | sid | actv_grp | final_events | final_trials | PC01 | PC02 | PC03 | PC04 | PC05 | ... | PC69 | PC70 | PC71 | PC72 | PC73 | PC74 | PC75 | PC76 | PC77 | avg_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 2.0 | 2.0 | -1.343494 | 0.614381 | 3.048655 | -2.980032 | -0.717602 | ... | 0.007764 | -0.042438 | 0.007726 | 0.016882 | -0.033860 | -0.020212 | -0.020811 | 0.002340 | 0.013947 | 1.782928 |
| 1 | 1 | 1 | Blank | 2.0 | 2.0 | 2.264423 | -0.252348 | 3.635242 | -1.695365 | -0.567017 | ... | 0.016730 | 0.010136 | -0.022828 | -0.049764 | -0.025509 | -0.011965 | -0.003058 | 0.010145 | -0.003176 | 2.297561 |
| 2 | 1 | 1 | Deeds | 2.0 | 2.0 | 2.407197 | -0.285384 | 3.514516 | -1.835526 | -0.700168 | ... | -0.002833 | -0.015106 | 0.000015 | -0.020132 | -0.009533 | -0.014708 | 0.001780 | 0.028581 | -0.000724 | 2.392456 |
| 3 | 1 | 1 | Diagram | 2.0 | 2.0 | 1.800267 | -0.177009 | 3.836746 | -0.226538 | 0.360890 | ... | 0.019233 | -0.019455 | -0.021534 | -0.068026 | -0.055084 | 0.009613 | 0.001835 | -0.002256 | -0.030537 | 1.808362 |
| 4 | 1 | 1 | Other | 2.0 | 2.0 | 2.285621 | -0.236315 | 3.690869 | -1.931407 | -0.768909 | ... | -0.003566 | 0.041341 | -0.008955 | -0.027152 | -0.000594 | -0.012622 | 0.027929 | 0.006442 | -0.031819 | 2.362641 |
5 rows × 83 columns
df02['t'] = ( (df02.final_events - df02.avg_count)**2 - df02.avg_count ) / df02.avg_count
aux_mod = smf.ols( 't ~ avg_count - 1', data = df02).fit()
print( aux_mod.summary() )
OLS Regression Results
=======================================================================================
Dep. Variable: t R-squared (uncentered): 0.002
Model: OLS Adj. R-squared (uncentered): 0.002
Method: Least Squares F-statistic: 4.881
Date: Thu, 27 Apr 2023 Prob (F-statistic): 0.0272
Time: 07:52:57 Log-Likelihood: -4813.7
No. Observations: 2444 AIC: 9629.
Df Residuals: 2443 BIC: 9635.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
avg_count -0.0439 0.020 -2.209 0.027 -0.083 -0.005
==============================================================================
Omnibus: 4162.344 Durbin-Watson: 0.966
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6897299.670
Skew: 11.151 Prob(JB): 0.00
Kurtosis: 262.295 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
my_coefplot( aux_mod )
Model 02 is slightly underdispersed. The auxillary model's slope is not statistically significantly positive and the scale of the auxillary statistics and error bar is low in magnitude, so the Poisson regression assumption of Variance = mean is valid in this case.
dm00_y, dm00_X = dmatrices('final_events ~ sid', data=final_sqrt_df, return_type='dataframe')
dm01_y, dm01_X = dmatrices('final_events ~ sid + actv_grp', data=final_sqrt_df, return_type='dataframe')
dm02_y, dm02_X = dmatrices('final_events ~ sid + actv_grp + ' + num_features_str, data=final_sqrt_df, return_type='dataframe')
dm03_y, dm03_X = dmatrices('final_events ~ ' + num_features_str, data=final_sqrt_df, return_type='dataframe')
dm04_y, dm04_X = dmatrices('final_events ~ (' + num_features_str + ')**2', data=final_sqrt_df, return_type='dataframe')
dm00_X.head()
| Intercept | sid[T.2] | sid[T.4] | sid[T.5] | sid[T.7] | sid[T.8] | sid[T.9] | sid[T.11] | sid[T.12] | sid[T.14] | ... | sid[T.91] | sid[T.92] | sid[T.94] | sid[T.95] | sid[T.99] | sid[T.101] | sid[T.102] | sid[T.103] | sid[T.104] | sid[T.106] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 62 columns
dm00_y.head()
| final_events | |
|---|---|
| 0 | 2.0 |
| 1 | 2.0 |
| 2 | 2.0 |
| 3 | 2.0 |
| 4 | 2.0 |
modNB00 = sm.GLM(dm00_y, dm00_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB01 = sm.GLM(dm01_y, dm01_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB02 = sm.GLM(dm02_y, dm02_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB03 = sm.GLM(dm03_y, dm03_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
modNB_list = [modNB00,modNB01,modNB02,modNB03]
modNB_results = pd.DataFrame({'model_name': ['modNB00','modNB01','modNB02','modNB03'],
'AIC': [mod.aic for mod in modNB_list],
'BIC': [mod.bic for mod in modNB_list]})
/Users/lisaover/opt/anaconda3/envs/cmpinf2120/lib/python3.8/site-packages/statsmodels/genmod/generalized_linear_model.py:1799: FutureWarning: The bic value is computed using the deviance formula. After 0.13 this will change to the log-likelihood based formula. This change has no impact on the relative rank of models compared using BIC. You can directly access the log-likelihood version using the `bic_llf` attribute. You can suppress this message by calling statsmodels.genmod.generalized_linear_model.SET_USE_BIC_LLF with True to get the LLF-based version now or False to retainthe deviance version. warnings.warn(
modNB_results
| model_name | AIC | BIC | |
|---|---|---|---|
| 0 | modNB00 | 7622.925542 | -16595.505299 |
| 1 | modNB01 | 7594.774720 | -16565.642208 |
| 2 | modNB02 | 7162.815262 | -16550.894535 |
| 3 | modNB03 | 7480.925401 | -16644.683179 |
sns.relplot(data = modNB_results.melt(id_vars=['model_name']),
x='model_name',
y='value',
col='variable',
col_wrap=2,
facet_kws = {'sharey': False})
plt.show()
print(modNB02.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: final_events No. Observations: 2444
Model: GLM Df Residuals: 2295
Model Family: NegativeBinomial Df Model: 148
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -3432.4
Date: Thu, 27 Apr 2023 Deviance: 1353.3
Time: 07:52:58 Pearson chi2: 1.08e+03
No. Iterations: 24 Pseudo R-squ. (CS): 0.3464
Covariance Type: nonrobust
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.1358 0.288 0.472 0.637 -0.428 0.699
sid[T.2] -1.0080 0.294 -3.430 0.001 -1.584 -0.432
sid[T.4] -1.6195 0.336 -4.814 0.000 -2.279 -0.960
sid[T.5] -0.6385 0.286 -2.233 0.026 -1.199 -0.078
sid[T.7] -0.5035 0.280 -1.798 0.072 -1.052 0.045
sid[T.8] -1.8584 0.387 -4.807 0.000 -2.616 -1.101
sid[T.9] -1.3715 0.329 -4.167 0.000 -2.017 -0.726
sid[T.11] -0.2581 0.277 -0.931 0.352 -0.801 0.285
sid[T.12] -1.1180 0.319 -3.505 0.000 -1.743 -0.493
sid[T.14] -0.2066 0.276 -0.748 0.454 -0.748 0.335
sid[T.19] -0.7361 0.308 -2.389 0.017 -1.340 -0.132
sid[T.20] 0.1670 0.270 0.618 0.537 -0.363 0.697
sid[T.22] -1.7593 0.403 -4.363 0.000 -2.550 -0.969
sid[T.24] -1.2135 0.304 -3.993 0.000 -1.809 -0.618
sid[T.25] -0.7285 0.351 -2.077 0.038 -1.416 -0.041
sid[T.30] -0.5420 0.291 -1.860 0.063 -1.113 0.029
sid[T.33] -25.1326 4.14e+04 -0.001 1.000 -8.12e+04 8.11e+04
sid[T.34] -1.1691 0.308 -3.798 0.000 -1.772 -0.566
sid[T.37] -1.3000 0.361 -3.604 0.000 -2.007 -0.593
sid[T.38] -0.9761 0.296 -3.298 0.001 -1.556 -0.396
sid[T.39] -0.5863 0.323 -1.815 0.070 -1.219 0.047
sid[T.42] -1.8335 0.342 -5.368 0.000 -2.503 -1.164
sid[T.44] 0.8420 0.319 2.638 0.008 0.216 1.468
sid[T.45] 0.0189 0.337 0.056 0.955 -0.642 0.680
sid[T.46] -1.6314 0.398 -4.096 0.000 -2.412 -0.851
sid[T.47] -1.5105 0.333 -4.531 0.000 -2.164 -0.857
sid[T.49] -1.4406 0.303 -4.756 0.000 -2.034 -0.847
sid[T.51] -1.5684 0.323 -4.853 0.000 -2.202 -0.935
sid[T.52] -1.6853 0.323 -5.220 0.000 -2.318 -1.053
sid[T.54] -0.9987 0.299 -3.340 0.001 -1.585 -0.413
sid[T.55] 0.0110 0.303 0.036 0.971 -0.583 0.605
sid[T.56] -0.3047 0.285 -1.068 0.286 -0.864 0.255
sid[T.57] -24.5318 2.91e+04 -0.001 0.999 -5.7e+04 5.69e+04
sid[T.58] -1.0060 0.635 -1.583 0.113 -2.251 0.239
sid[T.59] -1.5392 0.327 -4.705 0.000 -2.180 -0.898
sid[T.60] -25.1975 2.23e+04 -0.001 0.999 -4.38e+04 4.37e+04
sid[T.61] -0.6115 0.303 -2.015 0.044 -1.206 -0.017
sid[T.62] -1.0567 0.424 -2.491 0.013 -1.888 -0.225
sid[T.64] -25.5006 2.36e+04 -0.001 0.999 -4.64e+04 4.63e+04
sid[T.67] -0.3932 0.325 -1.211 0.226 -1.029 0.243
sid[T.68] -0.0400 0.279 -0.143 0.886 -0.587 0.507
sid[T.69] -0.8165 0.337 -2.425 0.015 -1.476 -0.157
sid[T.70] -0.5305 0.303 -1.750 0.080 -1.125 0.064
sid[T.71] -0.5808 0.319 -1.818 0.069 -1.207 0.045
sid[T.73] -1.2784 0.351 -3.641 0.000 -1.967 -0.590
sid[T.75] -0.0387 0.300 -0.129 0.897 -0.627 0.549
sid[T.77] -0.6072 0.633 -0.959 0.338 -1.848 0.634
sid[T.79] -0.7467 0.294 -2.536 0.011 -1.324 -0.170
sid[T.80] -1.2728 0.319 -3.991 0.000 -1.898 -0.648
sid[T.82] -2.1508 0.330 -6.511 0.000 -2.798 -1.503
sid[T.83] -1.9852 0.346 -5.732 0.000 -2.664 -1.306
sid[T.87] -0.5944 0.283 -2.102 0.036 -1.149 -0.040
sid[T.91] -1.5260 0.313 -4.871 0.000 -2.140 -0.912
sid[T.92] -0.8436 0.306 -2.756 0.006 -1.444 -0.244
sid[T.94] -0.4750 0.278 -1.706 0.088 -1.021 0.071
sid[T.95] -1.2901 0.318 -4.060 0.000 -1.913 -0.667
sid[T.99] -1.6616 0.373 -4.457 0.000 -2.392 -0.931
sid[T.101] -1.5112 0.343 -4.403 0.000 -2.184 -0.839
sid[T.102] -1.3697 0.337 -4.066 0.000 -2.030 -0.709
sid[T.103] -26.0552 2.57e+04 -0.001 0.999 -5.05e+04 5.04e+04
sid[T.104] -1.2339 0.445 -2.776 0.006 -2.105 -0.363
sid[T.106] 0.4919 0.529 0.930 0.352 -0.544 1.528
actv_grp[T.Blank] 0.0502 0.128 0.393 0.694 -0.200 0.301
actv_grp[T.Deeds] 0.0282 0.128 0.221 0.825 -0.222 0.279
actv_grp[T.Diagram] -0.0721 0.128 -0.565 0.572 -0.322 0.178
actv_grp[T.FSM] -0.5637 0.319 -1.766 0.077 -1.189 0.062
actv_grp[T.FSM_Related] -0.5141 0.263 -1.958 0.050 -1.029 0.001
actv_grp[T.Other] 0.0708 0.131 0.539 0.590 -0.187 0.328
actv_grp[T.Properties] -0.0540 0.125 -0.431 0.666 -0.300 0.192
actv_grp[T.Study] 0.0734 0.130 0.566 0.571 -0.181 0.327
actv_grp[T.Study_Materials] 0.0167 0.296 0.057 0.955 -0.563 0.597
actv_grp[T.TextEditor] -0.0060 0.128 -0.047 0.963 -0.257 0.245
total_ms_tp000_sqrt 0.0002 0.000 0.838 0.402 -0.000 0.001
mw_tp000_sqrt 0.0049 0.008 0.624 0.533 -0.011 0.020
mwc_tp000_sqrt -0.0453 0.081 -0.557 0.578 -0.205 0.114
mcl_tp000_sqrt -0.0160 0.018 -0.894 0.371 -0.051 0.019
mcr_tp000_sqrt -0.0188 0.021 -0.896 0.370 -0.060 0.022
mm_tp000_sqrt 0.0025 0.002 1.284 0.199 -0.001 0.006
ks_tp000_sqrt -0.0067 0.006 -1.134 0.257 -0.018 0.005
total_ms_tp010_sqrt 0.0005 0.000 1.406 0.160 -0.000 0.001
mw_tp010_sqrt 0.0010 0.010 0.094 0.925 -0.019 0.021
mwc_tp010_sqrt 0.0248 0.102 0.244 0.807 -0.175 0.224
mcl_tp010_sqrt 0.0293 0.024 1.240 0.215 -0.017 0.076
mcr_tp010_sqrt -0.0091 0.025 -0.367 0.713 -0.058 0.039
mm_tp010_sqrt -0.0068 0.003 -2.313 0.021 -0.012 -0.001
ks_tp010_sqrt 0.0030 0.007 0.409 0.683 -0.011 0.017
total_ms_tp020_sqrt 0.0001 0.000 0.302 0.762 -0.001 0.001
mw_tp020_sqrt -0.0100 0.014 -0.716 0.474 -0.037 0.017
mwc_tp020_sqrt 0.0252 0.134 0.188 0.851 -0.238 0.288
mcl_tp020_sqrt -0.0418 0.031 -1.366 0.172 -0.102 0.018
mcr_tp020_sqrt 0.0013 0.032 0.042 0.966 -0.061 0.063
mm_tp020_sqrt 0.0065 0.004 1.607 0.108 -0.001 0.014
ks_tp020_sqrt -0.0090 0.009 -0.953 0.341 -0.028 0.010
total_ms_tp030_sqrt 0.0003 0.001 0.629 0.530 -0.001 0.001
mw_tp030_sqrt 0.0140 0.016 0.900 0.368 -0.017 0.045
mwc_tp030_sqrt 0.0117 0.140 0.084 0.933 -0.263 0.287
mcl_tp030_sqrt 0.0292 0.037 0.786 0.432 -0.044 0.102
mcr_tp030_sqrt 0.0975 0.040 2.436 0.015 0.019 0.176
mm_tp030_sqrt -0.0069 0.005 -1.433 0.152 -0.016 0.003
ks_tp030_sqrt -0.0087 0.012 -0.716 0.474 -0.033 0.015
total_ms_tp040_sqrt 0.0007 0.001 1.172 0.241 -0.000 0.002
mw_tp040_sqrt -0.0135 0.016 -0.838 0.402 -0.045 0.018
mwc_tp040_sqrt 0.0211 0.122 0.172 0.863 -0.219 0.261
mcl_tp040_sqrt -0.0314 0.041 -0.760 0.447 -0.112 0.050
mcr_tp040_sqrt 0.0365 0.046 0.797 0.426 -0.053 0.126
mm_tp040_sqrt 0.0006 0.005 0.104 0.918 -0.010 0.011
ks_tp040_sqrt -0.0177 0.015 -1.205 0.228 -0.046 0.011
total_ms_tp050_sqrt 0.0009 0.001 1.408 0.159 -0.000 0.002
mw_tp050_sqrt 0.0400 0.021 1.929 0.054 -0.001 0.081
mwc_tp050_sqrt -0.1063 0.150 -0.708 0.479 -0.401 0.188
mcl_tp050_sqrt -0.0281 0.045 -0.631 0.528 -0.115 0.059
mcr_tp050_sqrt 0.0263 0.047 0.555 0.579 -0.066 0.119
mm_tp050_sqrt -0.0034 0.006 -0.577 0.564 -0.015 0.008
ks_tp050_sqrt 0.0049 0.015 0.326 0.745 -0.024 0.034
total_ms_tp060_sqrt -0.0008 0.001 -1.059 0.290 -0.002 0.001
mw_tp060_sqrt -0.0266 0.024 -1.133 0.257 -0.073 0.019
mwc_tp060_sqrt 0.2214 0.232 0.955 0.340 -0.233 0.676
mcl_tp060_sqrt -0.0343 0.047 -0.727 0.467 -0.127 0.058
mcr_tp060_sqrt -0.0076 0.056 -0.134 0.893 -0.118 0.103
mm_tp060_sqrt 0.0073 0.006 1.153 0.249 -0.005 0.020
ks_tp060_sqrt 0.0150 0.017 0.876 0.381 -0.019 0.049
total_ms_tp070_sqrt 0.0007 0.001 0.831 0.406 -0.001 0.002
mw_tp070_sqrt 0.0168 0.023 0.722 0.470 -0.029 0.062
mwc_tp070_sqrt -0.0605 0.265 -0.228 0.820 -0.581 0.460
mcl_tp070_sqrt 0.0406 0.048 0.841 0.400 -0.054 0.135
mcr_tp070_sqrt -0.1275 0.062 -2.062 0.039 -0.249 -0.006
mm_tp070_sqrt -0.0062 0.006 -0.956 0.339 -0.019 0.007
ks_tp070_sqrt -0.0105 0.019 -0.550 0.582 -0.048 0.027
total_ms_tp080_sqrt 0.0004 0.001 0.515 0.607 -0.001 0.002
mw_tp080_sqrt 0.0076 0.026 0.290 0.772 -0.044 0.059
mwc_tp080_sqrt -0.1163 0.310 -0.375 0.708 -0.724 0.491
mcl_tp080_sqrt -0.0105 0.052 -0.203 0.839 -0.112 0.091
mcr_tp080_sqrt -0.0517 0.071 -0.728 0.466 -0.191 0.087
mm_tp080_sqrt 0.0014 0.007 0.200 0.842 -0.013 0.015
ks_tp080_sqrt 0.0071 0.020 0.347 0.728 -0.033 0.047
total_ms_tp090_sqrt -0.0004 0.001 -0.476 0.634 -0.002 0.001
mw_tp090_sqrt -0.0260 0.026 -0.994 0.320 -0.077 0.025
mwc_tp090_sqrt -0.1285 0.283 -0.454 0.650 -0.683 0.426
mcl_tp090_sqrt 0.0290 0.050 0.576 0.565 -0.070 0.128
mcr_tp090_sqrt -0.0127 0.079 -0.161 0.872 -0.168 0.142
mm_tp090_sqrt -0.0016 0.007 -0.223 0.824 -0.015 0.012
ks_tp090_sqrt 0.0280 0.020 1.386 0.166 -0.012 0.068
total_ms_tp100_sqrt -0.0016 0.001 -2.897 0.004 -0.003 -0.001
mw_tp100_sqrt 0.0060 0.016 0.377 0.706 -0.025 0.037
mwc_tp100_sqrt 0.1975 0.189 1.045 0.296 -0.173 0.568
mcl_tp100_sqrt 0.0082 0.033 0.245 0.807 -0.057 0.074
mcr_tp100_sqrt 0.1447 0.055 2.612 0.009 0.036 0.253
mm_tp100_sqrt 0.0029 0.005 0.627 0.531 -0.006 0.012
ks_tp100_sqrt -0.0007 0.012 -0.057 0.955 -0.025 0.024
===============================================================================================
my_coefplot(modNB02)
pc_dm00_y, pc_dm00_X = dmatrices('final_events ~ sid', data=pc_df_to_model, return_type='dataframe')
pc_dm01_y, pc_dm01_X = dmatrices('final_events ~ sid + actv_grp', data=pc_df_to_model, return_type='dataframe')
pc_dm02_y, pc_dm02_X = dmatrices('final_events ~ sid + actv_grp + ' + pc_features_str, data=pc_df_to_model, return_type='dataframe')
pc_dm03_y, pc_dm03_X = dmatrices('final_events ~ ' + pc_features_str, data=pc_df_to_model, return_type='dataframe')
pc_dm04_y, pc_dm04_X = dmatrices('final_events ~ (' + pc_features_str + ')**2', data=pc_df_to_model, return_type='dataframe')
pc_dm05_y, pc_dm05_X = dmatrices('final_events ~ actv_grp +' + pc_features_str, data=pc_df_to_model, return_type='dataframe')
pc_dm06_y, pc_dm06_X = dmatrices('final_events ~ actv_grp * (' + pc_features_str + ')', data=pc_df_to_model, return_type='dataframe')
pc_dm07_y, pc_dm07_X = dmatrices('final_events ~ actv_grp + (' + pc_features_str + ')**2', data=pc_df_to_model, return_type='dataframe')
pc_modNB00 = sm.GLM(pc_dm00_y, pc_dm00_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB01 = sm.GLM(pc_dm01_y, pc_dm01_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB02 = sm.GLM(pc_dm02_y, pc_dm02_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB03 = sm.GLM(pc_dm03_y, pc_dm03_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB04 = sm.GLM(pc_dm04_y, pc_dm04_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB05 = sm.GLM(pc_dm05_y, pc_dm05_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB06 = sm.GLM(pc_dm06_y, pc_dm06_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB07 = sm.GLM(pc_dm07_y, pc_dm07_X, family=sm.families.NegativeBinomial()).fit(loglike_method='nb2', bic_llf=True)
pc_modNB_list = [pc_modNB00,pc_modNB01,pc_modNB02,pc_modNB03,pc_modNB04,pc_modNB05,pc_modNB06,pc_modNB07]
pc_modNB_results = pd.DataFrame({'model_name': ['pc_modNB00','pc_modNB01','pc_modNB02','pc_modNB03','pc_modNB04','pc_modNB05','pc_modNB06','pc_modNB07'],
'AIC': [mod.aic for mod in pc_modNB_list],
'BIC': [mod.bic for mod in pc_modNB_list]})
/Users/lisaover/opt/anaconda3/envs/cmpinf2120/lib/python3.8/site-packages/statsmodels/genmod/generalized_linear_model.py:1799: FutureWarning: The bic value is computed using the deviance formula. After 0.13 this will change to the log-likelihood based formula. This change has no impact on the relative rank of models compared using BIC. You can directly access the log-likelihood version using the `bic_llf` attribute. You can suppress this message by calling statsmodels.genmod.generalized_linear_model.SET_USE_BIC_LLF with True to get the LLF-based version now or False to retainthe deviance version. warnings.warn(
pc_modNB_results
| model_name | AIC | BIC | |
|---|---|---|---|
| 0 | pc_modNB00 | 7622.925542 | -16595.505299 |
| 1 | pc_modNB01 | 7594.774720 | -16565.642208 |
| 2 | pc_modNB02 | 7203.795463 | -16910.210334 |
| 3 | pc_modNB03 | 7614.470712 | -16911.433869 |
| 4 | pc_modNB04 | 7561.898320 | -16801.567304 |
| 5 | pc_modNB05 | 7604.852347 | -16863.038321 |
| 6 | pc_modNB06 | 7690.765764 | -16313.013598 |
| 7 | pc_modNB07 | 7563.870430 | -16741.581281 |
sns.relplot(data = pc_modNB_results.melt(id_vars=['model_name']),
x='model_name',
y='value',
col='variable',
col_wrap=2,
facet_kws = {'sharey': False},
height=5, aspect=2)
plt.show()
print(pc_modNB02.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: final_events No. Observations: 2444
Model: GLM Df Residuals: 2364
Model Family: NegativeBinomial Df Model: 79
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -3521.9
Date: Thu, 27 Apr 2023 Deviance: 1532.3
Time: 07:53:00 Pearson chi2: 1.20e+03
No. Iterations: 24 Pseudo R-squ. (CS): 0.2967
Covariance Type: nonrobust
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 1.0430 0.208 5.004 0.000 0.635 1.452
sid[T.2] -0.8497 0.275 -3.090 0.002 -1.389 -0.311
sid[T.4] -1.7163 0.308 -5.572 0.000 -2.320 -1.113
sid[T.5] -0.2894 0.272 -1.066 0.287 -0.822 0.243
sid[T.7] -0.5547 0.265 -2.094 0.036 -1.074 -0.035
sid[T.8] -1.9066 0.362 -5.263 0.000 -2.617 -1.197
sid[T.9] -1.2008 0.304 -3.952 0.000 -1.796 -0.605
sid[T.11] -0.4263 0.267 -1.598 0.110 -0.949 0.097
sid[T.12] -1.1423 0.292 -3.911 0.000 -1.715 -0.570
sid[T.14] -0.1230 0.265 -0.465 0.642 -0.642 0.396
sid[T.19] -0.5337 0.296 -1.802 0.072 -1.114 0.047
sid[T.20] 0.3042 0.255 1.195 0.232 -0.195 0.803
sid[T.22] -1.9202 0.371 -5.170 0.000 -2.648 -1.192
sid[T.24] -0.8558 0.285 -3.001 0.003 -1.415 -0.297
sid[T.25] -0.9986 0.330 -3.024 0.002 -1.646 -0.351
sid[T.30] -0.2159 0.271 -0.796 0.426 -0.747 0.315
sid[T.33] -24.9732 4.2e+04 -0.001 1.000 -8.23e+04 8.23e+04
sid[T.34] -1.3095 0.289 -4.526 0.000 -1.877 -0.742
sid[T.37] -1.0763 0.350 -3.074 0.002 -1.762 -0.390
sid[T.38] -0.9661 0.281 -3.442 0.001 -1.516 -0.416
sid[T.39] -0.5894 0.286 -2.064 0.039 -1.149 -0.030
sid[T.42] -1.8109 0.303 -5.980 0.000 -2.404 -1.217
sid[T.44] 1.2965 0.279 4.651 0.000 0.750 1.843
sid[T.45] 0.1783 0.320 0.557 0.577 -0.449 0.806
sid[T.46] -1.1605 0.369 -3.145 0.002 -1.884 -0.437
sid[T.47] -1.4276 0.307 -4.654 0.000 -2.029 -0.826
sid[T.49] -1.2740 0.287 -4.432 0.000 -1.837 -0.711
sid[T.51] -1.5121 0.302 -5.008 0.000 -2.104 -0.920
sid[T.52] -1.5999 0.306 -5.225 0.000 -2.200 -1.000
sid[T.54] -1.1983 0.284 -4.216 0.000 -1.755 -0.641
sid[T.55] -0.1298 0.295 -0.440 0.660 -0.709 0.449
sid[T.56] -0.2980 0.267 -1.116 0.264 -0.821 0.225
sid[T.57] -24.8433 2.93e+04 -0.001 0.999 -5.74e+04 5.73e+04
sid[T.58] -0.7900 0.613 -1.289 0.197 -1.991 0.411
sid[T.59] -1.5478 0.309 -5.014 0.000 -2.153 -0.943
sid[T.60] -25.0779 2.23e+04 -0.001 0.999 -4.38e+04 4.37e+04
sid[T.61] -0.6813 0.288 -2.363 0.018 -1.246 -0.116
sid[T.62] -0.8847 0.416 -2.128 0.033 -1.700 -0.070
sid[T.64] -25.4076 2.39e+04 -0.001 0.999 -4.68e+04 4.67e+04
sid[T.67] -0.1744 0.308 -0.566 0.572 -0.779 0.430
sid[T.68] -0.1111 0.268 -0.414 0.679 -0.637 0.415
sid[T.69] -0.8471 0.317 -2.673 0.008 -1.468 -0.226
sid[T.70] -0.6750 0.287 -2.352 0.019 -1.237 -0.113
sid[T.71] -0.5233 0.307 -1.705 0.088 -1.125 0.078
sid[T.73] -1.0350 0.319 -3.240 0.001 -1.661 -0.409
sid[T.75] -0.0514 0.285 -0.180 0.857 -0.610 0.507
sid[T.77] -0.8524 0.611 -1.395 0.163 -2.050 0.346
sid[T.79] -0.8940 0.272 -3.289 0.001 -1.427 -0.361
sid[T.80] -1.2820 0.293 -4.376 0.000 -1.856 -0.708
sid[T.82] -1.9009 0.309 -6.150 0.000 -2.507 -1.295
sid[T.83] -1.9442 0.320 -6.084 0.000 -2.571 -1.318
sid[T.87] -0.5847 0.267 -2.188 0.029 -1.109 -0.061
sid[T.91] -1.4830 0.299 -4.967 0.000 -2.068 -0.898
sid[T.92] -0.6252 0.282 -2.218 0.027 -1.178 -0.073
sid[T.94] -0.4256 0.263 -1.617 0.106 -0.942 0.090
sid[T.95] -1.1891 0.296 -4.018 0.000 -1.769 -0.609
sid[T.99] -1.6806 0.342 -4.918 0.000 -2.350 -1.011
sid[T.101] -1.2669 0.323 -3.928 0.000 -1.899 -0.635
sid[T.102] -1.3029 0.311 -4.192 0.000 -1.912 -0.694
sid[T.103] -25.7103 2.66e+04 -0.001 0.999 -5.22e+04 5.21e+04
sid[T.104] -1.1449 0.419 -2.732 0.006 -1.966 -0.323
sid[T.106] 0.6389 0.442 1.446 0.148 -0.227 1.505
actv_grp[T.Blank] -0.0227 0.122 -0.186 0.852 -0.261 0.216
actv_grp[T.Deeds] -0.0472 0.120 -0.393 0.694 -0.282 0.188
actv_grp[T.Diagram] -0.2020 0.121 -1.666 0.096 -0.440 0.036
actv_grp[T.FSM] -0.9596 0.305 -3.144 0.002 -1.558 -0.361
actv_grp[T.FSM_Related] -0.7790 0.257 -3.033 0.002 -1.282 -0.276
actv_grp[T.Other] 0.0190 0.125 0.152 0.879 -0.226 0.264
actv_grp[T.Properties] -0.1445 0.121 -1.197 0.231 -0.381 0.092
actv_grp[T.Study] -0.0045 0.123 -0.036 0.971 -0.246 0.237
actv_grp[T.Study_Materials] -0.3177 0.288 -1.101 0.271 -0.883 0.248
actv_grp[T.TextEditor] -0.0649 0.120 -0.540 0.589 -0.300 0.171
PC01 0.0527 0.006 8.363 0.000 0.040 0.065
PC02 -0.0298 0.013 -2.283 0.022 -0.055 -0.004
PC03 -0.1282 0.012 -10.563 0.000 -0.152 -0.104
PC04 0.0394 0.014 2.824 0.005 0.012 0.067
PC05 -0.0303 0.015 -2.074 0.038 -0.059 -0.002
PC06 -0.1435 0.016 -8.893 0.000 -0.175 -0.112
PC07 0.1002 0.020 5.048 0.000 0.061 0.139
PC08 -0.0413 0.025 -1.641 0.101 -0.091 0.008
===============================================================================================
my_coefplot(pc_modNB02)
best_model = smf.poisson( formula = 'final_events ~ sid + actv_grp + ' + pc_features_str, data = pc_df_to_model).fit(method='ncg')
Optimization terminated successfully.
Current function value: 1.325239
Iterations: 22
Function evaluations: 24
Gradient evaluations: 24
Hessian evaluations: 22
print(best_model.summary())
Poisson Regression Results
==============================================================================
Dep. Variable: final_events No. Observations: 2444
Model: Poisson Df Residuals: 2364
Method: MLE Df Model: 79
Date: Thu, 27 Apr 2023 Pseudo R-squ.: 0.2280
Time: 07:53:01 Log-Likelihood: -3238.9
converged: True LL-Null: -4195.6
Covariance Type: nonrobust LLR p-value: 0.000
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 0.8569 0.125 6.841 0.000 0.611 1.102
sid[T.2] -0.6452 0.171 -3.766 0.000 -0.981 -0.309
sid[T.4] -1.4606 0.199 -7.326 0.000 -1.851 -1.070
sid[T.5] -0.1323 0.163 -0.813 0.416 -0.451 0.187
sid[T.7] -0.4821 0.157 -3.071 0.002 -0.790 -0.174
sid[T.8] -1.5239 0.236 -6.462 0.000 -1.986 -1.062
sid[T.9] -0.8776 0.179 -4.896 0.000 -1.229 -0.526
sid[T.11] -0.2181 0.161 -1.356 0.175 -0.533 0.097
sid[T.12] -0.8716 0.183 -4.765 0.000 -1.230 -0.513
sid[T.14] -0.0160 0.154 -0.104 0.917 -0.318 0.286
sid[T.19] -0.4075 0.182 -2.245 0.025 -0.763 -0.052
sid[T.20] 0.3226 0.145 2.221 0.026 0.038 0.607
sid[T.22] -1.7517 0.278 -6.308 0.000 -2.296 -1.207
sid[T.24] -0.6711 0.170 -3.952 0.000 -1.004 -0.338
sid[T.25] -0.7196 0.187 -3.844 0.000 -1.086 -0.353
sid[T.30] -0.1450 0.165 -0.880 0.379 -0.468 0.178
sid[T.33] -12.0511 119.257 -0.101 0.920 -245.791 221.689
sid[T.34] -1.0135 0.178 -5.681 0.000 -1.363 -0.664
sid[T.37] -0.9214 0.251 -3.671 0.000 -1.413 -0.430
sid[T.38] -0.7803 0.171 -4.565 0.000 -1.115 -0.445
sid[T.39] -0.6693 0.170 -3.947 0.000 -1.002 -0.337
sid[T.42] -1.4754 0.192 -7.667 0.000 -1.853 -1.098
sid[T.44] 1.2138 0.161 7.533 0.000 0.898 1.530
sid[T.45] 0.2702 0.182 1.484 0.138 -0.087 0.627
sid[T.46] -1.0982 0.277 -3.962 0.000 -1.641 -0.555
sid[T.47] -1.1412 0.195 -5.853 0.000 -1.523 -0.759
sid[T.49] -1.1224 0.183 -6.138 0.000 -1.481 -0.764
sid[T.51] -1.4247 0.212 -6.708 0.000 -1.841 -1.008
sid[T.52] -1.4138 0.205 -6.898 0.000 -1.815 -1.012
sid[T.54] -0.9313 0.177 -5.261 0.000 -1.278 -0.584
sid[T.55] 0.0668 0.170 0.394 0.694 -0.266 0.399
sid[T.56] -0.1048 0.154 -0.680 0.496 -0.407 0.197
sid[T.57] -12.7017 119.257 -0.107 0.915 -246.442 221.039
sid[T.58] -0.6118 0.427 -1.434 0.152 -1.448 0.224
sid[T.59] -1.3104 0.212 -6.174 0.000 -1.726 -0.894
sid[T.60] -13.4938 119.257 -0.113 0.910 -247.234 220.247
sid[T.61] -0.5787 0.178 -3.246 0.001 -0.928 -0.229
sid[T.62] -0.6456 0.276 -2.335 0.020 -1.187 -0.104
sid[T.64] -13.6254 119.257 -0.114 0.909 -247.366 220.115
sid[T.67] -0.1626 0.180 -0.903 0.366 -0.515 0.190
sid[T.68] 0.0336 0.151 0.222 0.824 -0.263 0.330
sid[T.69] -0.7821 0.214 -3.649 0.000 -1.202 -0.362
sid[T.70] -0.7380 0.176 -4.188 0.000 -1.083 -0.393
sid[T.71] -0.4077 0.197 -2.072 0.038 -0.793 -0.022
sid[T.73] -0.8393 0.194 -4.335 0.000 -1.219 -0.460
sid[T.75] 0.0326 0.161 0.203 0.839 -0.283 0.348
sid[T.77] -0.6609 0.426 -1.552 0.121 -1.495 0.174
sid[T.79] -0.8551 0.165 -5.192 0.000 -1.178 -0.532
sid[T.80] -0.9316 0.180 -5.178 0.000 -1.284 -0.579
sid[T.82] -1.6979 0.212 -7.995 0.000 -2.114 -1.282
sid[T.83] -1.5917 0.203 -7.853 0.000 -1.989 -1.194
sid[T.87] -0.5131 0.160 -3.210 0.001 -0.826 -0.200
sid[T.91] -1.1428 0.182 -6.264 0.000 -1.500 -0.785
sid[T.92] -0.5063 0.181 -2.804 0.005 -0.860 -0.152
sid[T.94] -0.4604 0.155 -2.962 0.003 -0.765 -0.156
sid[T.95] -1.0420 0.185 -5.623 0.000 -1.405 -0.679
sid[T.99] -1.2638 0.212 -5.957 0.000 -1.680 -0.848
sid[T.101] -0.9256 0.197 -4.697 0.000 -1.312 -0.539
sid[T.102] -1.2173 0.200 -6.097 0.000 -1.609 -0.826
sid[T.103] -13.7455 119.257 -0.115 0.908 -247.486 219.995
sid[T.104] -0.8747 0.277 -3.154 0.002 -1.418 -0.331
sid[T.106] 0.7139 0.210 3.397 0.001 0.302 1.126
actv_grp[T.Blank] 0.0110 0.074 0.149 0.882 -0.134 0.156
actv_grp[T.Deeds] 0.0232 0.072 0.321 0.748 -0.119 0.165
actv_grp[T.Diagram] -0.1319 0.073 -1.816 0.069 -0.274 0.010
actv_grp[T.FSM] -0.8886 0.234 -3.805 0.000 -1.346 -0.431
actv_grp[T.FSM_Related] -0.6528 0.182 -3.578 0.000 -1.011 -0.295
actv_grp[T.Other] 0.0238 0.076 0.313 0.755 -0.125 0.173
actv_grp[T.Properties] -0.0863 0.073 -1.189 0.235 -0.229 0.056
actv_grp[T.Study] 0.0159 0.075 0.212 0.832 -0.131 0.163
actv_grp[T.Study_Materials] -0.2142 0.172 -1.245 0.213 -0.551 0.123
actv_grp[T.TextEditor] -0.0107 0.073 -0.147 0.883 -0.153 0.132
PC01 0.0564 0.004 15.066 0.000 0.049 0.064
PC02 -0.0306 0.008 -4.048 0.000 -0.045 -0.016
PC03 -0.1173 0.008 -15.385 0.000 -0.132 -0.102
PC04 0.0250 0.008 3.073 0.002 0.009 0.041
PC05 -0.0447 0.009 -5.207 0.000 -0.062 -0.028
PC06 -0.1237 0.010 -12.826 0.000 -0.143 -0.105
PC07 0.0997 0.012 8.191 0.000 0.076 0.124
PC08 -0.0274 0.015 -1.835 0.066 -0.057 0.002
===============================================================================================
my_coefplot(best_model)
pc_df_to_model.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2444 entries, 0 to 2443 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sess 2444 non-null object 1 sid 2444 non-null object 2 actv_grp 2444 non-null object 3 final_events 2444 non-null float64 4 final_trials 2444 non-null float64 5 PC01 2444 non-null float64 6 PC02 2444 non-null float64 7 PC03 2444 non-null float64 8 PC04 2444 non-null float64 9 PC05 2444 non-null float64 10 PC06 2444 non-null float64 11 PC07 2444 non-null float64 12 PC08 2444 non-null float64 13 PC09 2444 non-null float64 14 PC10 2444 non-null float64 15 PC11 2444 non-null float64 16 PC12 2444 non-null float64 17 PC13 2444 non-null float64 18 PC14 2444 non-null float64 19 PC15 2444 non-null float64 20 PC16 2444 non-null float64 21 PC17 2444 non-null float64 22 PC18 2444 non-null float64 23 PC19 2444 non-null float64 24 PC20 2444 non-null float64 25 PC21 2444 non-null float64 26 PC22 2444 non-null float64 27 PC23 2444 non-null float64 28 PC24 2444 non-null float64 29 PC25 2444 non-null float64 30 PC26 2444 non-null float64 31 PC27 2444 non-null float64 32 PC28 2444 non-null float64 33 PC29 2444 non-null float64 34 PC30 2444 non-null float64 35 PC31 2444 non-null float64 36 PC32 2444 non-null float64 37 PC33 2444 non-null float64 38 PC34 2444 non-null float64 39 PC35 2444 non-null float64 40 PC36 2444 non-null float64 41 PC37 2444 non-null float64 42 PC38 2444 non-null float64 43 PC39 2444 non-null float64 44 PC40 2444 non-null float64 45 PC41 2444 non-null float64 46 PC42 2444 non-null float64 47 PC43 2444 non-null float64 48 PC44 2444 non-null float64 49 PC45 2444 non-null float64 50 PC46 2444 non-null float64 51 PC47 2444 non-null float64 52 PC48 2444 non-null float64 53 PC49 2444 non-null float64 54 PC50 2444 non-null float64 55 PC51 2444 non-null float64 56 PC52 2444 non-null float64 57 PC53 2444 non-null float64 58 PC54 2444 non-null float64 59 PC55 2444 non-null float64 60 PC56 2444 non-null float64 61 PC57 2444 non-null float64 62 PC58 2444 non-null float64 63 PC59 2444 non-null float64 64 PC60 2444 non-null float64 65 PC61 2444 non-null float64 66 PC62 2444 non-null float64 67 PC63 2444 non-null float64 68 PC64 2444 non-null float64 69 PC65 2444 non-null float64 70 PC66 2444 non-null float64 71 PC67 2444 non-null float64 72 PC68 2444 non-null float64 73 PC69 2444 non-null float64 74 PC70 2444 non-null float64 75 PC71 2444 non-null float64 76 PC72 2444 non-null float64 77 PC73 2444 non-null float64 78 PC74 2444 non-null float64 79 PC75 2444 non-null float64 80 PC76 2444 non-null float64 81 PC77 2444 non-null float64 dtypes: float64(79), object(3) memory usage: 1.5+ MB
pc_df_to_model.head()
| sess | sid | actv_grp | final_events | final_trials | PC01 | PC02 | PC03 | PC04 | PC05 | ... | PC68 | PC69 | PC70 | PC71 | PC72 | PC73 | PC74 | PC75 | PC76 | PC77 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Aulaweb | 2.0 | 2.0 | -1.343494 | 0.614381 | 3.048655 | -2.980032 | -0.717602 | ... | 0.021226 | 0.007764 | -0.042438 | 0.007726 | 0.016882 | -0.033860 | -0.020212 | -0.020811 | 0.002340 | 0.013947 |
| 1 | 1 | 1 | Blank | 2.0 | 2.0 | 2.264423 | -0.252348 | 3.635242 | -1.695365 | -0.567017 | ... | -0.054488 | 0.016730 | 0.010136 | -0.022828 | -0.049764 | -0.025509 | -0.011965 | -0.003058 | 0.010145 | -0.003176 |
| 2 | 1 | 1 | Deeds | 2.0 | 2.0 | 2.407197 | -0.285384 | 3.514516 | -1.835526 | -0.700168 | ... | -0.008969 | -0.002833 | -0.015106 | 0.000015 | -0.020132 | -0.009533 | -0.014708 | 0.001780 | 0.028581 | -0.000724 |
| 3 | 1 | 1 | Diagram | 2.0 | 2.0 | 1.800267 | -0.177009 | 3.836746 | -0.226538 | 0.360890 | ... | -0.052102 | 0.019233 | -0.019455 | -0.021534 | -0.068026 | -0.055084 | 0.009613 | 0.001835 | -0.002256 | -0.030537 |
| 4 | 1 | 1 | Other | 2.0 | 2.0 | 2.285621 | -0.236315 | 3.690869 | -1.931407 | -0.768909 | ... | -0.031325 | -0.003566 | 0.041341 | -0.008955 | -0.027152 | -0.000594 | -0.012622 | 0.027929 | 0.006442 | -0.031819 |
5 rows × 82 columns
pc_df_to_model.PC01.min()
-16.59044209268085
pc_df_to_model.PC01.max()
25.04659416706056
pc_df_to_model.sid.unique()
array([1, 2, 4, 5, 7, 9, 11, 12, 14, 19, 20, 22, 30, 34, 37, 38, 39, 42,
44, 46, 47, 49, 51, 52, 54, 55, 56, 59, 62, 67, 68, 70, 71, 73, 79,
80, 82, 87, 91, 92, 94, 101, 102, 104, 8, 24, 61, 83, 95, 99, 103,
25, 45, 69, 75, 106, 33, 57, 58, 60, 64, 77], dtype=object)
input_grid_pc01 = pd.DataFrame([ (xa, xb, xc, xd, xe, xf, xg, xh, xi, xj) for xa in np.linspace(pc_df_to_model.PC01.min() - 0.02, pc_df_to_model.PC01.max() + 0.02, num=101)
for xb in [0.]
for xc in [0.]
for xd in [0.]
for xe in [0.]
for xf in [0.]
for xg in [0.]
for xh in [0.]
#for xi in pc_df_to_model.sid.unique()
for xi in [5, 14, 20, 44, 87, 94, 102, 106]
for xj in actv_subgrp_1],
columns = ['PC01','PC02','PC03','PC04','PC05','PC06','PC07','PC08','sid','actv_grp'])
input_grid_pc01.describe()
| PC01 | PC02 | PC03 | PC04 | PC05 | PC06 | PC07 | PC08 | sid | |
|---|---|---|---|---|---|---|---|---|---|
| count | 4848.000000 | 4848.0 | 4848.0 | 4848.0 | 4848.0 | 4848.0 | 4848.0 | 4848.0 | 4848.000000 |
| mean | 4.228076 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 59.000000 |
| std | 12.152093 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39.932179 |
| min | -16.610442 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.000000 |
| 25% | -6.191183 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 18.500000 |
| 50% | 4.228076 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 65.500000 |
| 75% | 14.647335 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 96.000000 |
| max | 25.066594 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 106.000000 |
input_grid_pc01.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4848 entries, 0 to 4847 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PC01 4848 non-null float64 1 PC02 4848 non-null float64 2 PC03 4848 non-null float64 3 PC04 4848 non-null float64 4 PC05 4848 non-null float64 5 PC06 4848 non-null float64 6 PC07 4848 non-null float64 7 PC08 4848 non-null float64 8 sid 4848 non-null int64 9 actv_grp 4848 non-null object dtypes: float64(8), int64(1), object(1) memory usage: 378.9+ KB
input_grid_pc01['pred_probability'] = best_model.predict(input_grid_pc01)
sns.relplot(data = input_grid_pc01, x='PC01', y='pred_probability',
hue='actv_grp', kind='line')
plt.show()
sns.relplot(data = input_grid_pc01, x='PC01', y='pred_probability', hue='actv_grp',
col='sid', col_wrap=2, kind='line')
plt.show()